splitting CJK text into "words"

Martin Wierschin Wed, 26 Sep 2012 14:16:16 -0700

Hello everyone,

I'm trying to split CJK text using the kind of word boundaries detected by 
-[NSAttributedString doubleClickAtIndex:]. That method does the job correctly, 
but only if the system preferences have the Word Break mode set to Japanese. I 
need to ensure this kind of word splitting independent of the user's system 
preferences.


It was my understanding that I could use CFStringTokenizer for this task, but 
it doesn't seem to be working. Test code that produces improper results:

> NSString* str = @"\u4E2D\u79CB\u5FEB\u5230\u4E86"; // 中秋快到了
> CFRange strRange = CFRangeMake(0, [str length]);
>       
> CFStringRef cjkIdent = 
> CFLocaleCreateCanonicalLocaleIdentifierFromString(NULL, CFSTR("jp"));
> CFLocaleRef cjkLoc = CFLocaleCreate( NULL, cjkIdent );
> CFStringTokenizerRef cjkTokenizer = CFStringTokenizerCreate( NULL, 
> (CFStringRef)str, strRange, kCFStringTokenizerUnitWordBoundary, cjkLoc );
>       
> CFStringTokenizerTokenType tokenType = 
> CFStringTokenizerAdvanceToNextToken(cjkTokenizer);
> CFRange wordRange = CFStringTokenizerGetCurrentTokenRange(cjkTokenizer);

This code sets the wordRange to (0,2) and not (0,5) as I'd like.

I've tried a variety of locale identifiers (eg: "zh", "jp_JP", etc) but no joy. 
Am I missing something?

Thanks for any help,
~Martin
_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [email protected]

splitting CJK text into "words"

Reply via email to