Ok, about U+1D160 I knew about NFC, NFD, etc. but I never bothered to look up how they were actually implemented
it seems to be "easier" to decompose than to compose, so if you get data from different sources (chrome, IE, etc.) you might be better off just NFD everything when trying to match strings I checked this: http://www.icu-project.org/docs/papers/optimized_unicode_composition_and_decomposition.html and it seems that the composition is done by creating a compositions table out of the decomposition table you linked before (UnicodeData.txt)... and then iteratively composing back each couple of character+combining-char >From this, I thought that for each and every codepoint cp cp dup length swap nfc length >= would hold true, but apparently that's not the case: http://stackoverflow.com/questions/17897534/can-unicode-nfc-normalization-increase-the-length-of-a-string currently I have no idea why is that, but apparently U+1D160 is exactly one of the worst cases (the normalized composed form of that codepoint results in 3 codepoints)... and so that's no bug, and parsing UnicodeData.txt yourself to correct it would probably result in some incompatibility with standard unicode implementations (but I can't quite point out which requirement would this violate) ------------------------------------------------------------------------------ Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk _______________________________________________ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk