I'm studying the normalization of Unicode so I apologize possible
stupidity in advance.

At Thu, 2 Mar 2017 15:50:34 +0900, Michael Paquier <michael.paqu...@gmail.com> 
wrote in <cab7npqrbr3gmfydedcahzukfkrgbltltmvenomprvwreszk...@mail.gmail.com>
> On Tue, Feb 21, 2017 at 9:53 AM, Michael Paquier
> <michael.paqu...@gmail.com> wrote:
> > On Mon, Feb 20, 2017 at 9:41 PM, Aleksander Alekseev
> > <a.aleks...@postgrespro.ru> wrote:
> >>> Speaking about flaws, it looks like there is a memory leak in
> >>> array_to_utf procedure - result is allocated twice.
> >
> > Pushed a fix for this one on my branch.
> >
> >> And a few more things I've noticed after a closer look:
> >>
> >> * build_client_first_message does not free `state->client_nonce` if
> >>   second malloc (for `buf`) fails
> >> * same for `state->client_first_message_bare`
> >> * ... and most other procedures declared in fe-auth-scram.c file
> >>   (see malloc and strdup calls)
> >
> > You are visibly missing pg_fe_scram_free().
> >
> >> * scram_Normalize doesn't check malloc return value
> >
> > Yes, I am aware of this one. This makes the interface utterly ugly
> > though because an error log message needs to be handled across many
> > API layers. We could just assume anything returning NULL is equivalent
> > to an OOM and nothing else though.
> Attached is a new patch set. I have combined SASLprep with the rest
> and fixed some conflicts. At the same time when going through NFKC
> this morning I have noticed that the implementation was doing the
> canonical decomposition and reordered the characters using the
> combining classes, but the string recomposition was still missing.
> This is addressed in this patch set, and well as on my dev tree:
> https://github.com/michaelpq/postgres/tree/scram

I looked into this and have some comments. Sorry for the random

Composition version should be written some where.

Perhaps one of the most important things is that the code exactly
reflects the TR. pg_utf8_check_string returns true for ASCII
strings but the TR says that

| Text containing only ASCII characters (U+0000 to U+007F) is left
| unaffected by all of the normalization forms. This is
| particularly important for programming languages

And running SASLprepare for apparent ASCII string (which is the
most case) is a waste of CPU cycles.

>From the point of view of reflectivity(please someone teach me an
appropreate wording for this..), basically the code had better to
be a copy of the reference code as long as no performance
degradation occurs. Hangul part of get_decomposed_size(and
decompose_code, recompose_code) uses different naming from the
TR. hindex should be sindex and t should be tindex. Magic numbers
should have names in the TR. 

* A bit later, I noticed that these are copies of charlint. If so
  I want a description about that.)

The following comment is equivalent to "reordering in canonical
order". But the definition of "decomposition" includes this
step. (I'm not sure if it needs rewriting, since I acutually
could understand that.)

> /*
>  * Now that the decomposition is done, apply the combining class
>  * for each multibyte character.
>  */

utf_sasl_prepare does canonical ordering in a bit different way
than the TR. Totally it should make a sequence of characters
starts with combining class = 0 and in the order of combining
class. The code does stable bubble sort within each combining
character and it seems to work as the same way. (In short, no
probelm found here.)

>    * make the allocation of the recomposed string based on that assumption.
>    */
>   recomp_chars = (pg_wchar *) malloc(decomp_size * sizeof(int));
>   lastClass = -1;     /* this eliminates a special check */

utf_sasl_prepare uses variable names with two naming
conventions. Is there any reason for the two?

> starterCh = recomp_chars[0] = decomp_chars[0];

starterCh reads as "starter channel" why not "starter_char"?

Other than the magic numbers, I don't think that the following is
not a good expression.

>   else if (start >= 0xAC00 && start < 0xD7A4 &&
>            !((start - 0xAC00) % 28) &&
>            code >= 0x11A8 && code < 0x11C3)

"!((start - 0xAC00) % 28)" is a similar of !strcmp(a, b) and it
is confusing. It would be better to be "((start - 0xAC00) % 28) == 0".

The last sub-condition "code >= 0x11A8 && code < 0x11C3"
corresnponds to "(0 <= TIndex && TIndex <= TCount)". TIndex here
is (code - 0x11a7) and TCount = 28 so this two are not identical.

Totally the condition should be like the following.

>   else if (start >= 0xAC00 && start < 0xD7A4 &&
>            ((start - 0xAC00) % 28) == 0 &&
>            code >= 0x11A7 && code <= 0x11C3)

The more preferable form is 

>   else if (start >= SBASE && start < (SBASE + SCOUNT) &&
>            ((start - SBASE) % TCOUNT) == 0 &&
>            code >= TBASE && code <= (TBASE + TCOUNT))

"code <= (TBASE + TCOUNT)" somewhat smells. Then I found the
original code for the current implementation in charlint and it
seems correct to me. Some description about the difference
between them is needed.

In use_sasl_prepare, the recompose part i sthe copy of charlint
but it seems a bit inefficient. It calls recompose_code
unconditionally but it is required only for the case of
"lastClass < chClass".

Something like this. (This still calls recompose_code for the
case that ch is the same position with starterChar so there still
be room for more improvement.)

>   if (lastClass < chClass &&
>       recompose_code(starterCh, ch, &composite))
>       recomp_chars[starterPos] = composite;
>       starterCh = composite;
>   }
>   else if (chClass == 0)

If I read the TR correctly, "6 Composition Exclusion Table" says
that there some characters not to be composed. But I don't find
the corresponding code. (or comments)


Kyotaro Horiguchi
NTT Open Source Software Center

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to