Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-13 Thread Mark Davis
ot; [EMAIL PROTECTED] Sent: Monday, February 12, 2001 20:30 Subject: Re: Korean linebreking and UTR14(was Re: extracting words) On Mon, 12 Feb 2001, Mark Davis wrote: Thank you for your answer. Asmus Freytag is the one to talk to; he can look into this. Do you think I should contact him

RE: extracting words

2001-02-13 Thread Christopher John Fynn
Mark Davis wrote: BTW, someone on this thread made this topic out to be even more complex than is: that Devanagari and Korean are written without spaces. While that may have been the case historically, I believe that the modern text does use spaces. Chinese, Japanese and Thai are the

RE: extracting words

2001-02-13 Thread jarkko . hietaniemi
BTW In traditional Tibetan orthography, a space is *not* a line break opportunity. What's the role of a space in there, then? - Chris -- Chris Fynn DDC Dzongkha Computing Project Thimphu, Bhutan.

RE: extracting words

2001-02-12 Thread jarkko . hietaniemi
- line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. Root extraction is decidecly non-trivial and a highly language-specific problem, even more so than

RE: extracting words

2001-02-12 Thread Mark Leisher
- line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. Jarkko Root extraction is decidecly non-trivial and a highly Jarkko

[OT?] Re: extracting words

2001-02-12 Thread DougEwell2
In a message dated 2001-02-12 8:54:10 Pacific Standard Time, [EMAIL PROTECTED] writes: Also, I would add - "syllablization" (is that a word?) as a third problem (for breaking words more nicely into lines), it would rank in difficulty somewhere between word breaking and root

Re: extracting words

2001-02-12 Thread Michael \(michka\) Kaplan
From: "Kenneth Whistler" [EMAIL PROTECTED] the tsek (U+0F0B) that roughly occurs between syllables. Yes, Tibetanists, I know that the term "syllable" is not technically correct here, so please don't nitpick me to death on this one. ;-) Ironically enough, there are a number of native speakers

Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Jungshik Shin
On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is the MD TR. Both can be found by going to www.unicode.org, and selecting the right MD topic. The TR in

Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Mark Davis
Asmus Freytag is the one to talk to; he can look into this. Mark - Original Message - From: "Jungshik Shin" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 13:33 Subject: Korean linebreking and UTR14(was Re: extracting words)

Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Jungshik Shin
On Mon, 12 Feb 2001, Mark Davis wrote: Thank you for your answer. Asmus Freytag is the one to talk to; he can look into this. Do you think I should contact him directly off-line? I thought he's on this list now as well as back in March 2000 when I wrote about TUS 3.0 p. 124. On Mon, 12

Re: extracting words

2001-02-12 Thread Jungshik Shin
On Sun, 11 Feb 2001, Mark Davis wrote: BTW, someone on this thread made this topic out to be even more complex than is: that Devanagari and Korean are written without spaces. While that may have been the case historically, I believe that the modern text does use spaces. Chinese, Japanese

Re: extracting words

2001-02-11 Thread Jungshik Shin
On Sat, 10 Feb 2001, Edward Cherlin wrote: At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the

Re: extracting words

2001-02-11 Thread Jonathan Lewis
in the first grade Korean class). It would have been more appropriate if you had come up with an example from Japanese or Chinese where spaces are rarely used to separate words. From Japanese, how about: kokodehakimonowonuidekudasai This could be koko de hakimono wo nuide kudasai (take

Re: extracting words

2001-02-11 Thread Mark Davis
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the Linebreak TR. For linebreak the only tricky language is Thai, since it requires a dictionary lookup (much like hyphenation in English). Java (and ICU) supply linebreak mechanisms as a part of the standard API. They also

Re: extracting words

2001-02-11 Thread Mark Davis
pic. - Original Message - From: "Mike Lischke" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Sunday, February 11, 2001 11:32 Subject: re: extracting words - line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (

Re: extracting words

2001-02-10 Thread Edward Cherlin
At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: Hello all, I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the word is in blacklist hashtable. This is

RE: extracting words

2001-02-10 Thread Makarand Gadre
icode List Subject: Re: extracting words At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: Hello all, I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see i

RE: extracting words

2001-01-29 Thread Christopher John Fynn
You might have to apply different rules dependant on the script. In Indic scripts there are often no explicit word boundary markers and you may have to look for grammatical particles. In Tibetan, a string of letters and vowels between two tsheg [0F0B / 0F0C] characters (or other "punctuation")

Re: extracting words

2001-01-29 Thread Lukas Pietsch
Christopher Fynn wrote: BTW without determining the language as well as the script, how do you propose to determine if a particular string actually matches a word in your "blacklist" (in terms of meaning) or not? The same string of characters might mean completely different things in two

Re: extracting words

2001-01-29 Thread John Cowan
Lukas Pietsch wrote: This is assuming that what we want is not just a matching of *orthographical* words (character strings), but of *lexicographical* words (aka lexemes). But it is impossible in fully cross-linguistic situations in general. There is simply nothing to do about the fact