Re: AM/PM letter UNICODE issues

Quincey Morris Mon, 18 Oct 2010 11:51:46 -0700

On Oct 18, 2010, at 10:19, Alex Kac wrote:

> What we are trying to do:
> Shorten the AM/PM to just the first character in Western Languages so that a 
> time is shown as "1:30a". 
> 
>       NSDateFormatter* formatter = [[NSDateFormatter alloc] init];
>       NSString* am = [[[formatter AMSymbol] substringToIndex:1] 
> lowercaseString];
>       NSString* pm = [[[formatter PMSymbol] substringToIndex:1] 
> lowercaseString];
> 
> 
> This works in Western languages just fine. However in languages like Korean 
> it does not work giving a random character seemingly. From reading on this 
> list over time I believe its because I'm just getting one part of a 
> multi-part character (I'm no good with unicode terms sorry). 
> 
> My guess is I need to use rangeOfComposedCharacterSequenceAtIndex and then 
> get the range and use a substring with that range. But I'm not sure since my 
> knowledge here is pretty limited.

This description seems pretty good (and short):

http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html

Basically, there are several nested levels of complexity:

1. UTF-16 units (which are the 16 bit values that are indexed by NSString's
'...AtIndex:' methods)

2. Unicode code points (which are UTF-16 units or surrogate pairs of UTF-16
units)

3. Composed characters (such as accented characters) made up of pairs of
Unicode code points

4. Grapheme clusters, which are sequences of Unicode code points representing
things that are written as a single unit (in some sense, depending on the
language)

5. Related character sequences (I don't know there's an official name for this)
such as German 'ß' and 'SS' that figure into algorithms for sorting and case
changing.

According to the above-linked page, #3 and #4 aren't really different.

Also according to the above-linked page,
'rangeOfComposedCharacterSequenceAtIndex:' does sound like the method to use.

It's not obvious that taking the first grapheme is going to be semantically
meaningful in every language (for example, if the English abbreviations
happened to be MA and MP, taking the first grapheme wouldn't help you -- the
assumption that the first character distinguishes the time range is not
necessarily valid across all languages), but at least it's not going to give
you an unrelated character.

_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [email protected]

Re: AM/PM letter UNICODE issues

Reply via email to