At Wed, 11 Oct 2000 14:33:30 +1100, Craig Small <[EMAIL PROTECTED]> wrote:
> I cannot do the following ones until I get a charset: > jp - iso-2022-jp `ja' if you refere language code, not country code. Anyway, iso-2022-jp is not so easy. It uses US-ASCII chars and escape sequences, such as "ESC $ B". In iso-2022-jp, if byte sequence encounter "ESC $ B" or "ESC $ @", then byte sequence switch to JIS X 0208 characters, which are 94x94 that means 2 byte char until "ESC ( B" or "ESC ( J", which means to back US-ASCII (or JIS X0201 Roman) So we can't say which byte set make words in iso-2022-jp encoding. For more information, you can see RFC1468: Japanese Character Encoding for Internet Messages http://www.rfc-editor.org/rfc/rfc1468.txt > For the dual-byte folks, I don't think this will work. The upstream > author is willing to work with you, but he's not sure how to do it. > Actually it may work... if you put both bytes into the charset. > Depends on what your whitespace looks like. Whitespace in iso-2022-jp is 0x20 and sequence "0x21 0x21" ("!!") between "0x1B 0x24 0x42" ("ESC $ B") or "0x1B 0x24 0x40" ("ESC $ @") and "0x1B 0x28 0x42" ("ESC ( B") or "0x1B 0x28 0x4A" ("ESC ( J") However, Japanese language doesn't use whitespace to separate words. This is why we need a tool such as chasen - Japanese Morphological Analysis System. Regards, Fumitoshi UKAI

