Ok, about U+1D160

I knew about NFC, NFD, etc. but I never bothered to look up how they
were actually implemented

it seems to be "easier" to decompose than to compose, so if you get
data from different sources (chrome, IE, etc.) you might be better off
just NFD everything when trying to match strings

I checked this:
http://www.icu-project.org/docs/papers/optimized_unicode_composition_and_decomposition.html

and it seems that the composition is done by creating a compositions
table out of the decomposition table you linked before
(UnicodeData.txt)... and then iteratively composing back each couple
of character+combining-char

>From this, I thought that for each and every codepoint cp

cp dup length swap nfc length >=

would hold true, but apparently that's not the case:

http://stackoverflow.com/questions/17897534/can-unicode-nfc-normalization-increase-the-length-of-a-string

currently I have no idea why is that, but apparently U+1D160 is
exactly one of the worst cases (the normalized composed form of that
codepoint results in 3 codepoints)... and so that's no bug, and
parsing UnicodeData.txt yourself to correct it would probably result
in some incompatibility with standard unicode implementations (but I
can't quite point out which requirement would this violate)

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Reply via email to