Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

Michael Paquier Fri, 03 Feb 2017 15:02:07 -0800

On Fri, Feb 3, 2017 at 9:52 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> On 12/20/2016 03:47 AM, Michael Paquier wrote:
>>
>> The first thing is to be able to understand in the SCRAM code if a
>> string is UTF-8 or not, and this code is in src/common/. pg_wchar.c
>> offers a set of routines exactly for this purpose, which is built with
>> libpq but that's not available for src/common/. So instead of moving
>> all the file, I'd like to create a new file in src/common/utf8.c which
>> includes pg_utf_mblen() and pg_utf8_islegal().
>
> Sounds reasonable. They're short functions, might also be ok to just
> copy-paste them to scram-common.c.


Having a separate file makes the most sense to me I think, if we can
avoid code duplication that's better.

>> The second thing is the normalization itself. Per RFC4013, NFKC needs
>> to be applied to the string.  The operation is described in [1]
>> completely, and it is named as doing 1) a compatibility decomposition
>> of the bytes of the string, followed by 2) a canonical composition.
>>
>> About 1). The compatibility decomposition is defined in [2], "by
>> recursively applying the canonical and compatibility mappings, then
>> applying the canonical reordering algorithm". Canonical and
>> compatibility mapping are some data available in UnicodeData.txt, the
>> 6th column of the set defined in [3] to be precise. The meaning of the
>> decomposition mappings is defined in [2] as well. The canonical
>> decomposition is basically to look for a given UTF-8 character, and
>> then apply the multiple characters resulting in its new shape. The
>> compatibility mapping should as well be applied, but [5], a perl tool
>> called charlint.pl doing this normalization work, does not care about
>
> Not sure. We need to do whatever the "right thing" is, according to the RFC.
> I would assume that the spec is not ambiguous this, but I haven't looked
> into the details. If it's ambiguous, then I think we need to look at some
> popular implementations to see what they do.

The spec defines quite correctly what should be done. The
implementations are sometimes quite loose on some points though (see
charlint.pl).

>> So what we need from Postgres side is a mapping table to, having the
>> following fields:
>> 1) Hexa sequence of UTF8 character.
>> 2) Its canonical combining class.
>> 3) The kind of decomposition mapping if defined.
>> 4) The decomposition mapping, in hexadecimal format.
>> Based on what I looked at, either perl or python could be used to
>> process UnicodeData.txt and to generate a header file that would be
>> included in the tree. There are 30k entries in UnicodeData.txt, 5k of
>> them have a mapping, so that will result in many tables. One thing to
>> improve performance would be to store the length of the table in a
>> static variable, order the entries by their hexadecimal keys and do a
>> dichotomy lookup to find an entry. We could as well use more fancy
>> things like a set of tables using a Radix tree using decomposed by
>> bytes. We should finish by just doing one lookup of the table for each
>> character sets anyway.
>
> Ok. I'm not too worried about the performance of this. It's only used for
> passwords, which are not that long, and it's only done when connecting. I'm
> more worried about the disk/memory usage. How small can we pack the tables?
> 10kB? 100kB? Even a few MB would probably not be too bad in practice, but
> I'd hate to bloat up libpq just for this.

Indeed. I think I'll develop first a small utility able to do
operation. There is likely some knowledge in mb/Unicode that we can
use here. The radix tree patch would perhaps help?

>> 3) The shape of the mapping table, which depends on how many
>> operations we want to support in the normalization of the strings.
>> The decisions for those items will drive the implementation in one
>> sense or another.
>
> Let's aim for small disk/memory footprint.

OK, I'll try to give it a shot in a couple of days in the shape of an
extention or something like that. Thanks for the feedback.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

Reply via email to