On Thu, Dec 15, 2016 at 3:17 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> In the case where the binaries are *not* built with libidn, I think
> that we had better reject valid UTF-8 string directly and just allow
> ASCII? SASLprep is a no-op on ASCII characters.
> Thoughts about this approach?

And Heikki has mentioned me that he'd prefer not having an extra
dependency for the normalization, which is LGPL-licensed by the way.
So I have looked at the SASLprep business to see what should be done
to get a complete implementation in core, completely independent of
anything known.

The first thing is to be able to understand in the SCRAM code if a
string is UTF-8 or not, and this code is in src/common/. pg_wchar.c
offers a set of routines exactly for this purpose, which is built with
libpq but that's not available for src/common/. So instead of moving
all the file, I'd like to create a new file in src/common/utf8.c which
includes pg_utf_mblen() and pg_utf8_islegal(). On top of that I think
that having a routine able to check a full string would be useful for
many users, as pg_utf8_islegal() can only check one set of characters.
If the password string is found to be of UTF-8 format, SASLprepare is
applied. If not, the string is copied as-is with perhaps unexpected
effects for the client But he's in trouble already if client is not
using UTF-8.

Then comes the real business... Note that's my first time touching
encoding, particularly UTF-8 in depth, so please be nice. I may write
things that are incorrect or sound so from here :)

The second thing is the normalization itself. Per RFC4013, NFKC needs
to be applied to the string.  The operation is described in [1]
completely, and it is named as doing 1) a compatibility decomposition
of the bytes of the string, followed by 2) a canonical composition.

About 1). The compatibility decomposition is defined in [2], "by
recursively applying the canonical and compatibility mappings, then
applying the canonical reordering algorithm". Canonical and
compatibility mapping are some data available in UnicodeData.txt, the
6th column of the set defined in [3] to be precise. The meaning of the
decomposition mappings is defined in [2] as well. The canonical
decomposition is basically to look for a given UTF-8 character, and
then apply the multiple characters resulting in its new shape. The
compatibility mapping should as well be applied, but [5], a perl tool
called charlint.pl doing this normalization work, does not care about
this phase... Do we?

About 2)... Once the decomposition has been applied, those bytes need
to be recomposed using the Canonical_Combining_Class field of
UnicodeData.txt in [3], which is the 3rd column of the set. Its values
are defined in [4]. An other interesting thing, charlint.pl [5] does
not care about this phase. I am wondering if we should as well not
just drop this part as well...

Once 1) and 2) are done, NKFC is complete, and so is SASLPrepare.

So what we need from Postgres side is a mapping table to, having the
following fields:
1) Hexa sequence of UTF8 character.
2) Its canonical combining class.
3) The kind of decomposition mapping if defined.
4) The decomposition mapping, in hexadecimal format.
Based on what I looked at, either perl or python could be used to
process UnicodeData.txt and to generate a header file that would be
included in the tree. There are 30k entries in UnicodeData.txt, 5k of
them have a mapping, so that will result in many tables. One thing to
improve performance would be to store the length of the table in a
static variable, order the entries by their hexadecimal keys and do a
dichotomy lookup to find an entry. We could as well use more fancy
things like a set of tables using a Radix tree using decomposed by
bytes. We should finish by just doing one lookup of the table for each
character sets anyway.

In conclusion, at this point I am looking for feedback regarding the
following items:
1) Where to put the UTF8 check routines and what to move.
2) How to generate the mapping table using UnicodeData.txt. I'd think
that using perl would be better.
3) The shape of the mapping table, which depends on how many
operations we want to support in the normalization of the strings.
The decisions for those items will drive the implementation in one
sense or another.

[1]: http://www.unicode.org/reports/tr15/#Description_Norm
[3]: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt
[5]: https://www.w3.org/International/charlint/

Heikki, others, thoughts?

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to