On 06/21/2007 09:42 PM, Tom Allison wrote:
OK, I sorted out what the deal is with charsets, Encode, utf8 and other
goodies.
Now I have something I'm just not sure exactly how it is supposet to
operate.
I have a string:
=?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.
After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I
can print out something that looks exactly like japanese characters.
But you can't match /(\w+) on them. It's apparently one "word" without
spaces in it.
Um... I don't know Japanese. But I guess this string of spaghetti (to
me) is actually a language where one character as represented in a
unicode terminal is actually one 'word' according to the perl definition
of a word...
In english, this would pick apart words in a sense that is simple for me
and many on this list to understand.
I guess my question is, for CJK languages, should I expect the notion of
using a regex like \w+ to pick up entire strings of text instead of
discrete words like latin based languages?
Sadly, I must admit that I'm operating way outside of my knowledge
domain on this one, but I'll try to give an answer.
Yes, be prepared for the fact that not all foreign languages will
support the concept of spaces between words. I don't know anything about
Japanese, but I do vaguely remember from high school that, for Chinese
texts, there are often no spaces between words and the reader's
knowledge of the language allows him or her to infer the word separations.
However, even without knowing Japanese, we might be able to help you
find acceptable solutions. What is your program supposed to do?
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/