Re: still working with utf8

Mumia W. Thu, 21 Jun 2007 21:53:05 -0700

On 06/21/2007 09:42 PM, Tom Allison wrote:

OK, I sorted out what the deal is with charsets, Encode, utf8 and othergoodies.
Now I have something I'm just not sure exactly how it is supposet tooperate.
I have a string:
=?iso-2022-jp?B?Rlc6IBskQjxkJDckNSRHJE8kSiQvJEYzWiQ3JF8kPyQkGyhC?=
That is a MIME::Base64 encoded string of iso-2022-jp characters.
After I decode_base64 them and decode($text,'iso-2022-jp',utf8') them Ican print out something that looks exactly like japanese characters.
But you can't match /(\w+) on them. It's apparently one "word" withoutspaces in it.Um... I don't know Japanese. But I guess this string of spaghetti (tome) is actually a language where one character as represented in aunicode terminal is actually one 'word' according to the perl definitionof a word...
In english, this would pick apart words in a sense that is simple for meand many on this list to understand.
I guess my question is, for CJK languages, should I expect the notion ofusing a regex like \w+ to pick up entire strings of text instead ofdiscrete words like latin based languages?

Sadly, I must admit that I'm operating way outside of my knowledgedomain on this one, but I'll try to give an answer.

Yes, be prepared for the fact that not all foreign languages willsupport the concept of spaces between words. I don't know anything aboutJapanese, but I do vaguely remember from high school that, for Chinesetexts, there are often no spaces between words and the reader'sknowledge of the language allows him or her to infer the word separations.

However, even without knowing Japanese, we might be able to help youfind acceptable solutions. What is your program supposed to do?



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: still working with utf8

Reply via email to