At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
>Hello all,
>
>I'm writing a C-program that is called Blacklist, It's purpose is to accept
>a string (unicode) and extract words from it, then hash the found words
>according to a hashing algorythm and see if the word is in blacklist
>hashtable.
>
>This is all very straightforward, but the problem is the extracting of
>wordsfrom this string.
>How do i determine what a word is in Japanese or Korean or whatever other
>language? { a space ? }

No. Chinese and Japanese almost never have spaces between words, and 
they are not required in Korean.

In Devanagari and related scripts a consonant at the end of a word 
can join with a vowel at the beginning of the next word in a single 
symbol, so you can't just divide the string into segments. There are 
other complications in other writing systems.

The problem is not trivial in Latin alphabet writing, either. 
Hyphenated expressions can be quasi-unified words where one or more 
components is not a separate word, or ad-hoc, even one-time-only 
phrases. The definition of words in a language is also changing. 
"Cannot" is currently one word, but used to be two. "An adder" used 
to be "a nadder".

>I think somebody must have had this problem and solved it, or maybe my
>approach to the problem is wrong.

Yes, we have had it for a long time; no, nobody has solved it 
entirely; and yes, this approach is wrong. Breaking a string into 
words may require a thorough understanding of the vocabulary and 
grammar of the language, and even that may not be enough.

An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange 
isseoyo (Father is in the bag)?

>I hope somebody can give me some good pointers, directions or suggestions.
>
>Thanks for your time,
>
>
>Brahim Mouhdi
>
>{42.}

-- 

Edward Cherlin, Spamfighter <http://www.cauce.org>
"It isn't what you don't know that hurts you, it's
what you know that ain't so."--Mark Twain, or else
some other prominent 19th century humorist and wit

Reply via email to