On Tue, Feb 7, 2017 at 11:28 AM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> Yes, I am actively working on this one now. I am trying to come up
> first with something in the shape of an extension to begin with, and
> get a patch out of it. That will be more simple for testing. For now
> the work that really remains in the patches attached on this thread is
> to get the internal work done, all the UTF8-related routines being
> already present in scram-common.c to work on the strings.

It took me a couple of days... And attached is the prototype
implementing SASLprep(), or NFKC if you want for UTF-8 strings. This
extension is pretty useful to check the validity of any normalization
forms defined in the Unicode specs, and is as well on my github:

In short, at build what this extension does is fetching
UnicodeData.txt, and it builds a conversion table including the fields
necessary for NFKC:
- The utf code of a character.
- The combining class of the character.
- The decomposition sets of characters.
I am aware of the fact that the implemention of this extension is the
worst possible, there are many bytes wasted, and the resulting shared
library gets at 2.4MB. Now as an example of how normalization forms
work that's a great study, and with that I am now aware of what needs
to be done to reduce the size of the conversion tables.

This extension has two conversion functions for UTF-8 string <=>
integer array (as UTF-8 characters are on 4 bytes with pg_wchar):
=# SELECT array_to_utf8('{50064}');
(1 row)
=# SELECT utf8_to_array('ÐÐÐ');
(1 row)

Then using an integer array in input SASLprep() can be done using
pg_sasl_prepare(), which returns to caller a decomposed recursively
set, with reordering done using the combining class integer array from
the conversion table generated wiht UnicodeData.txt. Lookups at the
conversion tables are done using bsearch(), so that's, at least I
guess, fast.

I have implemented as well a function to query the whole conversion as
a SRF (character number, combining class and decomposition), which is
useful for analysis:
=# select count(*) from utf8_conv_table();
(1 row)

Now using this module I have arrived to the following conclusions to
put to a minimum the size of the conversion tables, without much
impacting lookup performance:
- There are 24k characters with a combining class of 0 and no
decomposition for a total of 30k characters, those need to be dropped
from the conversion table.
- Most characters have a single, or double decomposition, one has a
decomposition of 18 characters. So we need to create two sets of
conversion tables:
-- A base table, with the character number (4 bytes), the combining
class (1 byte) and the size of the decomposition (1 byte).
-- A set of decomposition tables, classified by size.
So when decomposing a character, we check first the size of the
decomposition, then get the set from the correct table.

Now regarding the shape of the implementation for SCRAM, we need one
thing: a set of routines in src/common/ to build decompositions for a
given UTF-8 string with conversion UTF8 string <=> pg_wchar array, the
decomposition and the reordering. The extension attached roughly
implements that. What we can actually do as well is have in contrib/ a
module that does NFK[C|D] using the base APIs in src/common/. Using
arrays of pg_wchar (integers) to manipulate the characters, we can
validate and have a set of regression tests that do *not* have to
print non-ASCII characters. Let me know if this plan looks good, now I
think that I have enough material to get SASLprep done cleanly, with a
minimum memory footprint.

Heikki, others, how does that sound for you?

Attachment: pg_sasl_prepare.tar.gz
Description: GNU Zip compressed data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to