RE: Language and DoubeByte Language

Kerry Thompson Thu, 13 May 2004 11:08:42 -0700

> The product has some character entry...file names, order 
> stuff, etc.  And 
> there's also some character parsing like "Using % to wash 
> your clothes is 
> &." where they intend to have strings translated into 
> multiple languages 
> and hopefully still do some parse-replacing of characters.


The character entry will still need the native-language system, unless
you are restricting entry to few enough characters that they can choose
from an on-screen palette.

I don't see that type of parsing as a problem. I assume you will be
saving possible answers as separate entities, so you won't have to parse
the Chinese, Japanese, or Korean text.

I don't know if you saw my original response a week ago, so forgive me
if I repeat myself.

I'm not so familiar with Korean, but Chinese and Japanese have no
natural word breaks like most Western languages do, and that makes
parsing a real challenge. There are several issues:
- They may or may not use spaces. If they do, spaces will not
necessarily indicate a word break.
- There is no set sort order for Chinese/Kanji characters that
corresponds with our alphabetic ordering. There are a few commonly-used
ordering schemes, but nothing universally-agreed upon.
- Words can break across lines without hyphens.
- There is no way, other than context, to know if a character is a
self-standing word, or part of a multi-character word. For example,
"you" in Chinese (pronounced "yo", like Sylvester Stallone would). It
can be used by itself, it can be the first character in a word (you-yi,
friendship), or it can be the last character in a word (peng-you,
friend). Likewise, the "yi" in "you-yi" and the "peng" in "peng-you" can
appear in different contexts.
- Japanese freely mixes kanji, katakana, hiragana, and romaji.
- Chinese is usually read from left to right, but can be right to left,
or top to bottom.

I could go on, but I think you get the idea. Parsing CCJK is several
orders of magnitude more difficult than alphabetic languages. If you're
going to have to parse unknown input, you will need to set aside a huge
chunk of time and money to develop your parsing routines. And, they will
be different for Chinese, Japanese, and Korean.

Cordially,

Kerry Thompson


[To remove yourself from this list, or to change to digest mode, go to 
http://www.penworks.com/lingo-l.cgi  To post messages to the list, email [EMAIL 
PROTECTED]  (Problems, email [EMAIL PROTECTED]). Lingo-L is for learning and helping 
with programming Lingo.  Thanks!]

RE: Language and DoubeByte Language

Reply via email to