On 4 Mar '08, at 10:19 AM, Jonathan Dann wrote:

I'm most-likely going to have to support many text-encodings. Say if I'm writing a document in Jaspanese (Mac OS), will I have to convert that to UTF-8 before the methods of something like RegexKit would work? Any caveats you know of that I need to be aware of? I'm learning by doing.

It's not the encoding that's an issue, at least not at the point you're running a regex. Presumably you had to deal with encodings just to get the data into an NSString in the first place.

The limitation of PCRE is in its handling of character classes. IIRC, PCRE doesn't consider any character above 0x7F to be alphanumeric, so regex character types like "\w" won't match non-ascii letters. Worse, it detects word boundaries ("\b") by looking for a transition between word and non-word characters. Here the problem isn't just that it doesn't know about non-ascii word characters; it's that some languages have more complex rules for detecting word breaks. In Japanese and Thai, for example, words are often written without spaces in between them, and you have to use linguistic rules to determine where the breaks go. ICU knows how to do this.

The problem I ran into with PCRE is that I was implementing a typical filter field (the one in Safari RSS) that needed to match word prefixes. So the search regex began with "\b" to match the word break. But it didn't work correctly on most kanji text.

(Now, this was a few years ago. It's possible that PCRE's Unicode support has been improved since. If this is important to you, go check the docs.)

—Jens

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to