ot; [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 20:30
Subject: Re: Korean linebreking and UTR14(was Re: extracting words)
On Mon, 12 Feb 2001, Mark Davis wrote:
Thank you for your answer.
Asmus Freytag is the one to talk to; he can look into this.
Do you think I should contact him
Mark Davis wrote:
BTW, someone on this thread made this topic out to be even more complex than
is: that Devanagari and Korean are written without spaces. While that may
have been the case historically, I believe that the modern text does use
spaces. Chinese, Japanese and Thai are the
BTW In traditional Tibetan orthography, a space is *not* a line break
opportunity.
What's the role of a space in there, then?
- Chris
--
Chris Fynn
DDC Dzongkha Computing Project
Thimphu, Bhutan.
- line break (wrapping lines on the screen)
- word break (for selection)
- word/root extraction (for search)
I recognize that the second and third case are really
difficult to handle.
Root extraction is decidecly non-trivial and a highly language-specific
problem, even more so than
- line break (wrapping lines on the screen) - word break (for
selection) - word/root extraction (for search)
I recognize that the second and third case are really difficult to
handle.
Jarkko Root extraction is decidecly non-trivial and a highly
Jarkko
In a message dated 2001-02-12 8:54:10 Pacific Standard Time,
[EMAIL PROTECTED] writes:
Also, I would add
- "syllablization" (is that a word?) as a third problem (for breaking words
more nicely into lines), it would rank in difficulty somewhere between word
breaking and root
From: "Kenneth Whistler" [EMAIL PROTECTED]
the tsek (U+0F0B) that roughly occurs between syllables. Yes, Tibetanists,
I
know that the term "syllable" is not technically correct here, so please
don't
nitpick me to death on this one. ;-)
Ironically enough, there are a number of native speakers
On Sun, 11 Feb 2001, Mark Davis wrote:
MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
MD recommended in my last message. The Unicode standard is online, as is the
MD TR. Both can be found by going to www.unicode.org, and selecting the right
MD topic. The TR in
Asmus Freytag is the one to talk to; he can look into this.
Mark
- Original Message -
From: "Jungshik Shin" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 13:33
Subject: Korean linebreking and UTR14(was Re: extracting words)
On Mon, 12 Feb 2001, Mark Davis wrote:
Thank you for your answer.
Asmus Freytag is the one to talk to; he can look into this.
Do you think I should contact him directly off-line? I thought he's on
this list now as well as back in March 2000 when I wrote about TUS 3.0
p. 124.
On Mon, 12
On Sun, 11 Feb 2001, Mark Davis wrote:
BTW, someone on this thread made this topic out to be even more complex than
is: that Devanagari and Korean are written without spaces. While that may
have been the case historically, I believe that the modern text does use
spaces. Chinese, Japanese
On Sat, 10 Feb 2001, Edward Cherlin wrote:
At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
I'm writing a C-program that is called Blacklist, It's purpose is to accept
a string (unicode) and extract words from it, then hash the found words
according to a hashing algorythm and see if the
in the first grade Korean class). It would have been more appropriate
if you had come up with an example from Japanese or Chinese where spaces
are rarely used to separate words.
From Japanese, how about:
kokodehakimonowonuidekudasai
This could be
koko de hakimono wo nuide kudasai (take
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the
Linebreak TR. For linebreak the only tricky language is Thai, since it
requires a dictionary lookup (much like hyphenation in English). Java (and
ICU) supply linebreak mechanisms as a part of the standard API. They also
pic.
- Original Message -
From: "Mike Lischke" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Sunday, February 11, 2001 11:32
Subject: re: extracting words
- line break (wrapping lines on the screen)
- word break (for selection)
- word/root extraction (
At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
Hello all,
I'm writing a C-program that is called Blacklist, It's purpose is to accept
a string (unicode) and extract words from it, then hash the found words
according to a hashing algorythm and see if the word is in blacklist
hashtable.
This is
icode List
Subject: Re: extracting words
At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
Hello all,
I'm writing a C-program that is called Blacklist, It's purpose is to
accept a string (unicode) and extract words from it, then hash the
found words according to a hashing algorythm and see i
You might have to apply different rules dependant on the script. In Indic scripts
there are often no explicit word boundary markers and you may have to look for
grammatical particles. In Tibetan, a string of letters and vowels between two tsheg
[0F0B / 0F0C] characters (or other "punctuation")
Christopher Fynn wrote:
BTW without determining the language as well as the script, how do you
propose to determine if a particular string actually matches a word in
your "blacklist" (in terms of meaning) or not? The same string of
characters might mean completely different things in two
Lukas Pietsch wrote:
This is assuming that what we want is not just a matching of
*orthographical* words (character strings), but of *lexicographical* words
(aka lexemes).
But it is impossible in fully cross-linguistic situations in general.
There is simply nothing to do about the fact
20 matches
Mail list logo