> On 2 Apr 2017, at 10:59, Aki Inoue <[email protected]> wrote:
> 
> 
>> On Apr 1, 2017, at 4:57 PM, Gerriet M. Denkmann <[email protected]> wrote:
>> 
>> 
>>> On 2 Apr 2017, at 06:33, Jens Alfke <[email protected]> wrote:
>>> 
>>> 
>>>> On Apr 1, 2017, at 11:58 AM, Gerriet M. Denkmann <[email protected]> 
>>>> wrote:
>>>> 
>>>> I think that the examples above show, that NSURL does indeed do something 
>>>> about normalising Unicode strings.
>>> 
>>> That makes sense; I’d expect that one of the RFCs covering URLs describes 
>>> normalization. Otherwise constructing URLs (for a REST API, say) could 
>>> become quite ambiguous because you wouldn’t know which way to encode 
>>> various Unicode characters.
>>> 
>>>> But my point is that NSURL gets the normalisation wrong in this case; or 
>>>> at least that it is not very consistent in normalising strings.
>>> 
>>> Yes, it does seem wrong that you can have two filenames that are treated as 
>>> distinct by the filesystem, but whose URL.path properties produce identical 
>>> NSStrings.
>> 
>> Sorry, my explanation was not quite clear: these two filenames look 
>> absolutely identical, but as a sequence of Unicode code points, they are not 
>> (tone-mark and vowel are in different order).
>> 
>> What puzzles me is that consonant + THAI CHARACTER MAI EK + THAI CHARACTER 
>> SARA UU gets normalised by NSURL to:  consonant + THAI CHARACTER SARA UU + 
>> THAI CHARACTER MAI EK (note the different order), whereas consonant + THAI 
>> CHARACTER MAI EK + THAI CHARACTER SARA II is left unchanged.
> Garret,
> 
> This is the standard Unicode Normalization behavior. Each Unicode character 
> is assigned the Unicode Combining Property, an integer value defining the 
> canonical ordering of combining marks.
> 
> The Unicode Combining Property for THAI CHARACTER SARA UU is 103, and THAI 
> CHARACTER MAI EK 107. So, MAI EK always comes after SARA UU in the canonical 
> order.
> 
> On the other hand, THAI CHARACTER SARA II has the property value 0 which 
> indicates the start of the reordering segment. That’s why the character is 
> not reordered in respect to other Thai combining characters.
> 
> Aki

Thanks a lot for this explanation.

I just read about  Combining_Character_Class in 
<http://unicode.org/reports/tr44/#Validation_of_CCC>.

What I did not find was an explanation why all Thai top-vowels (+ THAI 
CHARACTER MAI HAN-AKAT) have Combining_Character_Class 0, Not_Reordered, 
whereas the bottom vowels have 103.

Another strange thing: the tone marks have 107, but THAI CHARACTER THANTHAKHAT 
has 0. (This sometimes occurs together with ิ, e.g. เกียรติ์, or ุ, e.g. บงสุ์ )

If you have any links to an explanation for these (to me) rather strange 
decisions of the Unicode people, I would appreciate this very much.


Kind regards,

Gerriet.


_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [email protected]

Reply via email to