On 04/04/2017 01:52 PM, Heikki Linnakangas wrote:
On 03/31/2017 10:10 AM, Michael Paquier wrote:On Wed, Mar 8, 2017 at 10:39 PM, Robert Haas <robertmh...@gmail.com> wrote:On Tue, Mar 7, 2017 at 10:01 PM, Michael Paquier <michael.paqu...@gmail.com> wrote: I kinda hope Heikki is going to step up to the plate here, because I think he understands most of it already.The second person who knows a good deal about NFKC is Ishii-san. Attached is a rebased patch.Thanks, I'm looking at this now.
I will continue tomorrow, but I wanted to report on what I've done so far. Attached is a new patch version, quite heavily modified. Notable changes so far:
* Use Unicode codepoints, rather than UTF-8 bytes packed in a 32-bit ints. IMHO this makes the tables easier to read (to a human), and they are also packed slightly more tightly (see next two points), as you can fit more codepoints in a 16-bit integer.
* Reorganize the decomposition table. Instead of a separate binary-searched table for each "size", have one table that's ordered by codepoint, which contains a length and offset into another array, where all the decomposed sequences are. This makes the recomposition step somewhat slower, as it now has to grovel through an array that contains all decompositions, rather than just the array that contains decompositions of two characters. But this isn't performance-critical, readability and tight packing of the tables matter more.
* In the lookup table, store singleton decompositions that decompose to a single character, and the character fits in a uint16, directly in the main lookup table, instead of storing an index into the other table. This makes the tables somewhat smaller, as there are a lot of such decompositions.
* Use uint32 instead of pg_wchar to represent unicode codepoints. pg_wchar suggests something that holds multi-byte characters in the OS locale, used by functions like wcscmp, so avoid that. I added a "typedef uint32 Codepoint" alias, though, to make it more clear which variables hold Unicode codepoints rather than e.g. indexes into arrays.
* Unicode consortium publishes a comprehensive normalization test suite, as a text file, NormalizationTest.txt. I wrote a small perl and C program to run all the tests from that test suite, with the normalization function. That uncovered a number of bugs in the recomposition algorithm, which I then fixed:
* decompositions that map to ascii characters cannot be excluded.* non-canonical and non-starter decompositions need to be excluded. They are now flagged in the lookup table, so that we only use them during the decomposition phase, not when recomposing.
TODOs / Discussion points:* I'm not sure I like the way the code is organized, I feel that we're littering src/common with these. Perhaps we should add a src/common/unicode_normalization directory for all this.
* The list of characters excluded from recomposition is currently hard-coded in utf_norm_generate.pl. However, that list is available in machine-readable format, in file CompositionExclusions.txt. Since we're reading most of the data from UnicodeData.txt, would be good to read the exclusion table from a file, too.
* SASLPrep specifies normalization form KC, but it also specifies that some characters are mapped to space or nothing. Should do those mappings, too.
-- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers