On 04/05/2017 07:23 AM, Michael Paquier wrote:

On Wed, Apr 5, 2017 at 7:05 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
I will continue tomorrow, but I wanted to report on what I've done so far.
Attached is a new patch version, quite heavily modified. Notable changes so

Great, thanks!

* Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints.
IMHO this makes the tables easier to read (to a human), and they are also
packed slightly more tightly (see next two points), as you can fit more
codepoints in a 16-bit integer.

Using directly codepoints is not much consistent with the existing
backend, but for the sake of packing things more, OK.

Oh, I see, we already have similar functions in wchar.c. unicode_to_utf8() and utf8_to_unicode(). We should probably move those to src/common, rather than re-invent the wheel.

pg_utf8_islegal() and pg_utf_mblen() should as well be moved in their
own file I think, and wchar.c can use that.


* The list of characters excluded from recomposition is currently hard-coded
in utf_norm_generate.pl. However, that list is available in machine-readable
format, in file CompositionExclusions.txt. Since we're reading most of the
data from UnicodeData.txt, would be good to read the exclusion table from a
file, too.

Ouch. Those are present here...
Definitely it makes more sense to read them from a file.

Did that.

* SASLPrep specifies normalization form KC, but it also specifies that some
characters are mapped to space or nothing. Should do those mappings, too.

Ah, right. Those ones are here:


Attached is a new version. Notable changes since yesterday:

* Implemented the rest of the SASLPrep, mapping some characters to spaces, leaving out others, and checking for prohibited characters and bidirectional strings.

* Moved things around. There's now a separate directory, src/common/unicode, which contains the perl scripts and the test code. Those are not needed to build from source, as the pre-generated tables are put in src/include/common. Similar to the scripts in src/backend/utils/mb/Unicode, really.

* Renamed many things from utf_* to unicode_*, since they don't deal with utf-8 input anymore.

This is starting to shape up, but still some cleanup work to do. I will continue tomorrow..

- Heikki

Attachment: implement-SASLprep-3.patch.gz
Description: application/gzip

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to