On Thu, Mar 2, 2017 at 9:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
> I'm studying the normalization of Unicode so I apologize possible
> stupidity in advance.
> I looked into this and have some comments. Sorry for the random
> order.

Thanks, this needs a lot of background and I am glad that somebody is
taking the time to look at what I am doing here.

> ====
> Composition version should be written some where.


> ====
> Perhaps one of the most important things is that the code exactly
> reflects the TR. pg_utf8_check_string returns true for ASCII
> strings but the TR says that
> | Text containing only ASCII characters (U+0000 to U+007F) is left
> | unaffected by all of the normalization forms. This is
> | particularly important for programming languages
> And running SASLprepare for apparent ASCII string (which is the
> most case) is a waste of CPU cycles.

Yeah, that's true. We could just for example check in
pg_utf8_check_string() if the length gathered matches strlen(source)
as only ASCII are 1-byte long.

> ====
> From the point of view of reflectivity (please someone teach me an
> appropreate wording for this..), basically the code had better to
> be a copy of the reference code as long as no performance
> degradation occurs. Hangul part of get_decomposed_size(and
> decompose_code, recompose_code) uses different naming from the
> TR. hindex should be sindex and t should be tindex. Magic numbers
> should have names in the TR.
> * A bit later, I noticed that these are copies of charlint. If so
>   I want a description about that.)

Yeah, their stuff works quite nicely.

> ====
> The following comment is equivalent to "reordering in canonical
> order". But the definition of "decomposition" includes this
> step. (I'm not sure if it needs rewriting, since I acutually
> could understand that.)
>> /*
>>  * Now that the decomposition is done, apply the combining class
>>  * for each multibyte character.
>>  */

I have reworked a bit this one:
-    * Now that the decomposition is done, apply the combining class
-    * for each multibyte character.
+    * Now end the decomposition by applying the combining class for
+    * each multibyte character.

> ====
>>    * make the allocation of the recomposed string based on that assumption.
>>    */
>>   recomp_chars = (pg_wchar *) malloc(decomp_size * sizeof(int));
>>   lastClass = -1;     /* this eliminates a special check */
> utf_sasl_prepare uses variable names with two naming
> conventions. Is there any reason for the two?

OK, did some adjustments here.

> ====
>> starterCh = recomp_chars[0] = decomp_chars[0];
> starterCh reads as "starter channel" why not "starter_char"?

Starter character of the current set, which is a character with a
combining class of 0.

> ====
>>   else if (start >= SBASE && start < (SBASE + SCOUNT) &&
>>            ((start - SBASE) % TCOUNT) == 0 &&
>>            code >= TBASE && code <= (TBASE + TCOUNT))
> "code <= (TBASE + TCOUNT)" somewhat smells. Then I found the
> original code for the current implementation in charlint and it
> seems correct to me. Some description about the difference
> between them is needed.

Right. I have updated all those things to use constants instead of
hardcoded values.

> ====
> In use_sasl_prepare, the recompose part is the copy of charlint
> but it seems a bit inefficient. It calls recompose_code
> unconditionally but it is required only for the case of
> "lastClass < chClass".
> Something like this. (This still calls recompose_code for the
> case that ch is the same position with starterChar so there still
> be room for more improvement.)


> ====
> If I read the TR correctly, "6 Composition Exclusion Table" says
> that there some characters not to be composed. But I don't find
> the corresponding code. (or comments)

Ah, right! There are 100 characters that enter in this category, and
all of them have a combining class of 0, so it is as simple as
removing them from the tables generated.

I am attaching 0009 and 0010 that address those problems (pushed on
github as well) that can be applied on top of the latest set.

Attachment: 0009-Set-of-fixes-for-SASLprep.patch
Description: Binary data

Attachment: 0010-Consider-characters-excluded-from-composition-in-con.patch
Description: Binary data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to