Like Edward saud, Getting words from a string is nontrivial. You get similar
issues in Thai. Thai coes not have any space between words, but the script
is Indic based (phonetic). You have to continuously look up the speller and
even then it can't be correct for all cases. E.g.

Sunday or therapist could be interpreted as two words sun & day while the
user meant Sunday etc. In sanskrit, you can create new words by doing a
"sandhi" or conjunction.


Makarand


-----Original Message-----
From: Edward Cherlin [mailto:[EMAIL PROTECTED]] 
Sent: Sunday, 11 February, 2001 05:34
To: Unicode List
Subject: Re: extracting words


At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
>Hello all,
>
>I'm writing a C-program that is called Blacklist, It's purpose is to 
>accept a string (unicode) and extract words from it, then hash the 
>found words according to a hashing algorythm and see if the word is in 
>blacklist hashtable.
>
>This is all very straightforward, but the problem is the extracting of 
>wordsfrom this string. How do i determine what a word is in Japanese or 
>Korean or whatever other language? { a space ? }

No. Chinese and Japanese almost never have spaces between words, and 
they are not required in Korean.

In Devanagari and related scripts a consonant at the end of a word 
can join with a vowel at the beginning of the next word in a single 
symbol, so you can't just divide the string into segments. There are 
other complications in other writing systems.

The problem is not trivial in Latin alphabet writing, either. 
Hyphenated expressions can be quasi-unified words where one or more 
components is not a separate word, or ad-hoc, even one-time-only 
phrases. The definition of words in a language is also changing. 
"Cannot" is currently one word, but used to be two. "An adder" used 
to be "a nadder".

>I think somebody must have had this problem and solved it, or maybe my 
>approach to the problem is wrong.

Yes, we have had it for a long time; no, nobody has solved it 
entirely; and yes, this approach is wrong. Breaking a string into 
words may require a thorough understanding of the vocabulary and 
grammar of the language, and even that may not be enough.

An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange 
isseoyo (Father is in the bag)?

>I hope somebody can give me some good pointers, directions or 
>suggestions.
>
>Thanks for your time,
>
>
>Brahim Mouhdi
>
>{42.}

-- 

Edward Cherlin, Spamfighter <http://www.cauce.org>
"It isn't what you don't know that hurts you, it's
what you know that ain't so."--Mark Twain, or else
some other prominent 19th century humorist and wit

Reply via email to