On 06/21/2007 09:42 PM, Tom Allison wrote:
OK, I sorted out what the deal is with charsets, Encode, utf8 and other goodies.

Now I have something I'm just not sure exactly how it is supposet to operate.

I have a string:
=?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.

After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them I can print out something that looks exactly like japanese characters.

But you can't match /(\w+) on them. It's apparently one "word" without spaces in it. Um... I don't know Japanese. But I guess this string of spaghetti (to me) is actually a language where one character as represented in a unicode terminal is actually one 'word' according to the perl definition of a word...

In english, this would pick apart words in a sense that is simple for me and many on this list to understand.

I guess my question is, for CJK languages, should I expect the notion of using a regex like \w+ to pick up entire strings of text instead of discrete words like latin based languages?


Sadly, I must admit that I'm operating way outside of my knowledge domain on this one, but I'll try to give an answer.

Yes, be prepared for the fact that not all foreign languages will support the concept of spaces between words. I don't know anything about Japanese, but I do vaguely remember from high school that, for Chinese texts, there are often no spaces between words and the reader's knowledge of the language allows him or her to infer the word separations.

However, even without knowing Japanese, we might be able to help you find acceptable solutions. What is your program supposed to do?


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to