On 02/09/2017 09:33 AM, Michael Paquier wrote:
On Tue, Feb 7, 2017 at 11:28 AM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
Yes, I am actively working on this one now. I am trying to come up
first with something in the shape of an extension to begin with, and
get a patch out of it. That will be more simple for testing. For now
the work that really remains in the patches attached on this thread is
to get the internal work done, all the UTF8-related routines being
already present in scram-common.c to work on the strings.


It took me a couple of days... And attached is the prototype
implementing SASLprep(), or NFKC if you want for UTF-8 strings.

Cool!

Now using this module I have arrived to the following conclusions to
put to a minimum the size of the conversion tables, without much
impacting lookup performance:
- There are 24k characters with a combining class of 0 and no
decomposition for a total of 30k characters, those need to be dropped
from the conversion table.
- Most characters have a single, or double decomposition, one has a
decomposition of 18 characters. So we need to create two sets of
conversion tables:
-- A base table, with the character number (4 bytes), the combining
class (1 byte) and the size of the decomposition (1 byte).
-- A set of decomposition tables, classified by size.
So when decomposing a character, we check first the size of the
decomposition, then get the set from the correct table.

Sounds good.

Now regarding the shape of the implementation for SCRAM, we need one
thing: a set of routines in src/common/ to build decompositions for a
given UTF-8 string with conversion UTF8 string <=> pg_wchar array, the
decomposition and the reordering. The extension attached roughly
implements that. What we can actually do as well is have in contrib/ a
module that does NFK[C|D] using the base APIs in src/common/. Using
arrays of pg_wchar (integers) to manipulate the characters, we can
validate and have a set of regression tests that do *not* have to
print non-ASCII characters.

A contrib module or built-in extra functions to deal with Unicode characters might be handy for a lot of things. But I'd leave that out for now, to keep this patch minimal.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to