On Fri, Feb 3, 2017 at 9:52 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > On 12/20/2016 03:47 AM, Michael Paquier wrote: >> >> The first thing is to be able to understand in the SCRAM code if a >> string is UTF-8 or not, and this code is in src/common/. pg_wchar.c >> offers a set of routines exactly for this purpose, which is built with >> libpq but that's not available for src/common/. So instead of moving >> all the file, I'd like to create a new file in src/common/utf8.c which >> includes pg_utf_mblen() and pg_utf8_islegal(). > > Sounds reasonable. They're short functions, might also be ok to just > copy-paste them to scram-common.c.
Having a separate file makes the most sense to me I think, if we can avoid code duplication that's better. >> The second thing is the normalization itself. Per RFC4013, NFKC needs >> to be applied to the string. The operation is described in [1] >> completely, and it is named as doing 1) a compatibility decomposition >> of the bytes of the string, followed by 2) a canonical composition. >> >> About 1). The compatibility decomposition is defined in [2], "by >> recursively applying the canonical and compatibility mappings, then >> applying the canonical reordering algorithm". Canonical and >> compatibility mapping are some data available in UnicodeData.txt, the >> 6th column of the set defined in [3] to be precise. The meaning of the >> decomposition mappings is defined in [2] as well. The canonical >> decomposition is basically to look for a given UTF-8 character, and >> then apply the multiple characters resulting in its new shape. The >> compatibility mapping should as well be applied, but [5], a perl tool >> called charlint.pl doing this normalization work, does not care about > > Not sure. We need to do whatever the "right thing" is, according to the RFC. > I would assume that the spec is not ambiguous this, but I haven't looked > into the details. If it's ambiguous, then I think we need to look at some > popular implementations to see what they do. The spec defines quite correctly what should be done. The implementations are sometimes quite loose on some points though (see charlint.pl). >> So what we need from Postgres side is a mapping table to, having the >> following fields: >> 1) Hexa sequence of UTF8 character. >> 2) Its canonical combining class. >> 3) The kind of decomposition mapping if defined. >> 4) The decomposition mapping, in hexadecimal format. >> Based on what I looked at, either perl or python could be used to >> process UnicodeData.txt and to generate a header file that would be >> included in the tree. There are 30k entries in UnicodeData.txt, 5k of >> them have a mapping, so that will result in many tables. One thing to >> improve performance would be to store the length of the table in a >> static variable, order the entries by their hexadecimal keys and do a >> dichotomy lookup to find an entry. We could as well use more fancy >> things like a set of tables using a Radix tree using decomposed by >> bytes. We should finish by just doing one lookup of the table for each >> character sets anyway. > > Ok. I'm not too worried about the performance of this. It's only used for > passwords, which are not that long, and it's only done when connecting. I'm > more worried about the disk/memory usage. How small can we pack the tables? > 10kB? 100kB? Even a few MB would probably not be too bad in practice, but > I'd hate to bloat up libpq just for this. Indeed. I think I'll develop first a small utility able to do operation. There is likely some knowledge in mb/Unicode that we can use here. The radix tree patch would perhaps help? >> 3) The shape of the mapping table, which depends on how many >> operations we want to support in the normalization of the strings. >> The decisions for those items will drive the implementation in one >> sense or another. > > Let's aim for small disk/memory footprint. OK, I'll try to give it a shot in a couple of days in the shape of an extention or something like that. Thanks for the feedback. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers