On Thu, Mar 2, 2017 at 9:13 PM, Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote: > I'm studying the normalization of Unicode so I apologize possible > stupidity in advance. > > I looked into this and have some comments. Sorry for the random > order.
Thanks, this needs a lot of background and I am glad that somebody is taking the time to look at what I am doing here. > ==== > Composition version should be written some where. Sure. > ==== > Perhaps one of the most important things is that the code exactly > reflects the TR. pg_utf8_check_string returns true for ASCII > strings but the TR says that > > | Text containing only ASCII characters (U+0000 to U+007F) is left > | unaffected by all of the normalization forms. This is > | particularly important for programming languages > > And running SASLprepare for apparent ASCII string (which is the > most case) is a waste of CPU cycles. Yeah, that's true. We could just for example check in pg_utf8_check_string() if the length gathered matches strlen(source) as only ASCII are 1-byte long. > ==== > From the point of view of reflectivity (please someone teach me an > appropreate wording for this..), basically the code had better to > be a copy of the reference code as long as no performance > degradation occurs. Hangul part of get_decomposed_size(and > decompose_code, recompose_code) uses different naming from the > TR. hindex should be sindex and t should be tindex. Magic numbers > should have names in the TR. > > * A bit later, I noticed that these are copies of charlint. If so > I want a description about that.) Yeah, their stuff works quite nicely. > ==== > The following comment is equivalent to "reordering in canonical > order". But the definition of "decomposition" includes this > step. (I'm not sure if it needs rewriting, since I acutually > could understand that.) > >> /* >> * Now that the decomposition is done, apply the combining class >> * for each multibyte character. >> */ I have reworked a bit this one: /* - * Now that the decomposition is done, apply the combining class - * for each multibyte character. + * Now end the decomposition by applying the combining class for + * each multibyte character. */ > ==== >> * make the allocation of the recomposed string based on that assumption. >> */ >> recomp_chars = (pg_wchar *) malloc(decomp_size * sizeof(int)); >> lastClass = -1; /* this eliminates a special check */ > > utf_sasl_prepare uses variable names with two naming > conventions. Is there any reason for the two? OK, did some adjustments here. > ==== >> starterCh = recomp_chars[0] = decomp_chars[0]; > > starterCh reads as "starter channel" why not "starter_char"? Starter character of the current set, which is a character with a combining class of 0. > ==== >> else if (start >= SBASE && start < (SBASE + SCOUNT) && >> ((start - SBASE) % TCOUNT) == 0 && >> code >= TBASE && code <= (TBASE + TCOUNT)) > > "code <= (TBASE + TCOUNT)" somewhat smells. Then I found the > original code for the current implementation in charlint and it > seems correct to me. Some description about the difference > between them is needed. Right. I have updated all those things to use constants instead of hardcoded values. > ==== > In use_sasl_prepare, the recompose part is the copy of charlint > but it seems a bit inefficient. It calls recompose_code > unconditionally but it is required only for the case of > "lastClass < chClass". > > Something like this. (This still calls recompose_code for the > case that ch is the same position with starterChar so there still > be room for more improvement.) Agreed. > ==== > If I read the TR correctly, "6 Composition Exclusion Table" says > that there some characters not to be composed. But I don't find > the corresponding code. (or comments) Ah, right! There are 100 characters that enter in this category, and all of them have a combining class of 0, so it is as simple as removing them from the tables generated. I am attaching 0009 and 0010 that address those problems (pushed on github as well) that can be applied on top of the latest set. -- Michael
0009-Set-of-fixes-for-SASLprep.patch
Description: Binary data
0010-Consider-characters-excluded-from-composition-in-con.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers