> On 2 Apr 2017, at 10:59, Aki Inoue <[email protected]> wrote: > > >> On Apr 1, 2017, at 4:57 PM, Gerriet M. Denkmann <[email protected]> wrote: >> >> >>> On 2 Apr 2017, at 06:33, Jens Alfke <[email protected]> wrote: >>> >>> >>>> On Apr 1, 2017, at 11:58 AM, Gerriet M. Denkmann <[email protected]> >>>> wrote: >>>> >>>> I think that the examples above show, that NSURL does indeed do something >>>> about normalising Unicode strings. >>> >>> That makes sense; I’d expect that one of the RFCs covering URLs describes >>> normalization. Otherwise constructing URLs (for a REST API, say) could >>> become quite ambiguous because you wouldn’t know which way to encode >>> various Unicode characters. >>> >>>> But my point is that NSURL gets the normalisation wrong in this case; or >>>> at least that it is not very consistent in normalising strings. >>> >>> Yes, it does seem wrong that you can have two filenames that are treated as >>> distinct by the filesystem, but whose URL.path properties produce identical >>> NSStrings. >> >> Sorry, my explanation was not quite clear: these two filenames look >> absolutely identical, but as a sequence of Unicode code points, they are not >> (tone-mark and vowel are in different order). >> >> What puzzles me is that consonant + THAI CHARACTER MAI EK + THAI CHARACTER >> SARA UU gets normalised by NSURL to: consonant + THAI CHARACTER SARA UU + >> THAI CHARACTER MAI EK (note the different order), whereas consonant + THAI >> CHARACTER MAI EK + THAI CHARACTER SARA II is left unchanged. > Garret, > > This is the standard Unicode Normalization behavior. Each Unicode character > is assigned the Unicode Combining Property, an integer value defining the > canonical ordering of combining marks. > > The Unicode Combining Property for THAI CHARACTER SARA UU is 103, and THAI > CHARACTER MAI EK 107. So, MAI EK always comes after SARA UU in the canonical > order. > > On the other hand, THAI CHARACTER SARA II has the property value 0 which > indicates the start of the reordering segment. That’s why the character is > not reordered in respect to other Thai combining characters. > > Aki
Thanks a lot for this explanation. I just read about Combining_Character_Class in <http://unicode.org/reports/tr44/#Validation_of_CCC>. What I did not find was an explanation why all Thai top-vowels (+ THAI CHARACTER MAI HAN-AKAT) have Combining_Character_Class 0, Not_Reordered, whereas the bottom vowels have 103. Another strange thing: the tone marks have 107, but THAI CHARACTER THANTHAKHAT has 0. (This sometimes occurs together with ิ, e.g. เกียรติ์, or ุ, e.g. บงสุ์ ) If you have any links to an explanation for these (to me) rather strange decisions of the Unicode people, I would appreciate this very much. Kind regards, Gerriet. _______________________________________________ Cocoa-dev mailing list ([email protected]) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [email protected]
