On 04/05/2017 07:23 AM, Michael Paquier wrote:
foreOn Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote:I will continue tomorrow, but I wanted to report on what I've done so far. Attached is a new patch version, quite heavily modified. Notable changes so far:Great, thanks!* Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints. IMHO this makes the tables easier to read (to a human), and they are also packed slightly more tightly (see next two points), as you can fit more codepoints in a 16-bit integer.Using directly codepoints is not much consistent with the existing backend, but for the sake of packing things more, OK.
Oh, I see, we already have similar functions in wchar.c. unicode_to_utf8() and utf8_to_unicode(). We should probably move those to src/common, rather than re-invent the wheel.
pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their own file I think, and wchar.c can use that.
* The list of characters excluded from recomposition is currently hard-coded in utf_norm_generate.pl. However, that list is available in machine-readable format, in file CompositionExclusions.txt. Since we're reading most of the data from UnicodeData.txt, would be good to read the exclusion table from a file, too.Ouch. Those are present here... http://www.unicode.org/reports/tr41/tr41-19.html#Exclusions Definitely it makes more sense to read them from a file.
* SASLPrep specifies normalization form KC, but it also specifies that some characters are mapped to space or nothing. Should do those mappings, too.Ah, right. Those ones are here: https://tools.ietf.org/html/rfc3454#appendix-B.1
Yep. Attached is a new version. Notable changes since yesterday:* Implemented the rest of the SASLPrep, mapping some characters to spaces, leaving out others, and checking for prohibited characters and bidirectional strings.
* Moved things around. There's now a separate directory, src/common/unicode, which contains the perl scripts and the test code. Those are not needed to build from source, as the pre-generated tables are put in src/include/common. Similar to the scripts in src/backend/utils/mb/Unicode, really.
* Renamed many things from utf_* to unicode_*, since they don't deal with utf-8 input anymore.
This is starting to shape up, but still some cleanup work to do. I will continue tomorrow..
-- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers