Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-13 Thread Mark Davis

If I want to get anyone's attention, I would send them a direct message.
Many people on the list, myself included, get swamped at times and don't
necessarily look at every message.

Mark

- Original Message -
From: "Jungshik Shin" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 20:30
Subject: Re: Korean linebreking and UTR14(was Re: extracting words)




 On Mon, 12 Feb 2001, Mark Davis wrote:

 Thank you for your answer.

  Asmus Freytag is the one to talk to; he can look into this.

 Do you think I should contact him directly off-line? I thought he's on
 this list now as well as  back in March 2000 when I wrote about TUS 3.0
 p. 124.

   On Mon, 12  Feb 2001, "Jungshik Shin" [EMAIL PROTECTED] wrote:
   On Sun, 11 Feb 2001, Mark Davis wrote:
  
   MD Please read TUS Chapter 5 and the Linebreak TR before proceeding,
as I
   MD recommended in my last message. The Unicode standard is online, as
is

   As I wrote when TUS 3.0 came out, I cannot help wondering where the
idea
   that leads to the following in the TR on line breaking (and what's
written
   about it in Chap 5o of TUS 3.0) came from.
  
   UTR14   Korean may alternately use a space-based (style 1) instead of
the
   UTR14   style 2 context analysis.

 BTW, this clearly shows that what Rick McGowan wrote about 'either ... or'
 in response to what I wrote about Korean line breaking rule (TUS 3.0
 p. 124) in March 2000 is not right like I argued then.  I'm sure he's
 right about 'either ... or ' in English grammar but the intention of the
 author is on my side if the author of UTR 14 is the same as that of the
 part  in question in TUS 3.0. I'm enclosing at the end of this message
 a part of my message in response to him.


   I'm very alarmed to find this 'misinformation' crept into the UTS and
   UTR14 (now UAX #14). It would be nice if  somebody in charge could get
   this straightened.

 This didn't make it in Unicode 3.1, either. What would be the best way
 to get it addressed before next revision comes out? I'm afraid just
 raising it  on this list wouldn't be sufficient (of course, I should
 have followed up more vigorously last year)

 Regards,

 Jungshik Shin


 Enc.

 1. Two messages of mine
the first one : March 1, 2000
the second one: March 2, 2000

 From: Jungshik Shin [EMAIL PROTECTED]
 Subject: Korean line breaking rules : Unicode 3.0 (p. 124)
 Date: Wed, 1 Mar 2000 19:23:23 -0800 (PST)

 On Sun, 13 Feb 2000, Kenneth Whistler wrote:

  Lest anyone feel unduly constrained, let me note that now that
  the editorial committee has closed the book, so to speak, on Unicode
3.0,
  all of you who are about to open the book for the first time should
  feel free to unleash your commentary on the text.

I've just received my copy of Unicode 3.0 book, here goes
 my first commentary.

On page 124(section 5.15 Locatiing Text element boundaries),
 the third paragraph has the following around the end:

 U3.0 In particular, word, line, and sentence boundaries will need to
 U3.0 be customized according to locale and user preference. In Korean,
 U3.0 for example, lines may be broken either at spaces(as in Latin text)
or
 U3.0 on ideographic boundaries (as in Chinese).

   First of all, it's a great mystery to me how on earth this
 strange notion of Korean having *two* different line breaking rules(as
 opposed to one)  crept into the expertise of non-Korean experts on Korean
 and finally made it into Unicode 3.0 book and Unicode TR on line breaking.

   None of tens of Korean books on my bookshelves
 I've just gone through breaks lines *exclusively* at spaces. All of them
 break lines freely at *syllables*. Only places where lines are broken
 *exclusively* at spaces(for Korean text)  I can think of are completely
 *broken*(as far as Korean line breaking is concerned) web browsers like
 Netscape and MS IE and possibly earlier implementations of Korean LaTeX.
 One may add  to the list Korean text formatted by non-localized version
 of 'fmt' (in Unix) as another example. To work around the problem caused
 by these broken web browsers, some Korean web authors apply a simple
 filter to insert wbr between every pair of Korean syllables to their
 html files. To see what I mean, you may wanna take a look at
 http://photon.hgs.yale.edu/~jungshik/lb.html and
 http://photon.hgs.yale.edu/~jungshik/lbscreenshot.jpg

   Let me emphasize that line can be broken at any syllable boundaries
 in Korean text (except for some obvious exceptions as applied in English
 text: i.e. punctuation marks like '!', '?' cannot begin a line).

   Secondly, even in Latin scripts(well, at least in English) lines can
 be broken not only at spaces but also at syllables(syllabic boundaries)
 with hyphen.  Only difference between Korean line breaking and English
 line breaking is Korean doesn't need hyphen when lines are broken at
 syllables because in Korean syllables  form  another visual unit a

RE: extracting words

2001-02-13 Thread Christopher John Fynn

 Mark Davis wrote:

 BTW, someone on this thread made this topic out to be even more complex than
 is: that Devanagari and Korean are written without spaces. While that may
 have been the case historically, I believe that the modern text does use
 spaces. Chinese, Japanese and Thai are the main languages written without
 spaces.

Several Indic languages/scripts do not use spaces (or other marker characters) between 
words or syllables. I don't think you can even rely on spaces between words for all 
the different Indic languages that use only the devanagari script.  

Tibetan script has a "syllable" (or morpheme) separator [U+0F0B] which provides a line 
break opportunity - but in modern Dzongkha (Bhutanese)  this character is dropped in 
many places where a reader can determine the boundary by grammatical rules. BTW In 
traditional Tibetan orthography, a space is *not* a line break opportunity.

- Chris  

--
Chris Fynn 
DDC Dzongkha Computing Project
Thimphu, Bhutan.



RE: extracting words

2001-02-13 Thread jarkko . hietaniemi

 BTW In traditional Tibetan orthography, a space is *not* a line break
opportunity.

What's the role of a space in there, then?
 
 - Chris  
 
 --
 Chris Fynn 
 DDC Dzongkha Computing Project
 Thimphu, Bhutan.



RE: extracting words

2001-02-12 Thread jarkko . hietaniemi

  - line break (wrapping lines on the screen)
  - word break (for selection)
  - word/root extraction (for search)
 
 I recognize that the second and third case are really 
 difficult to handle.

Root extraction is decidecly non-trivial and a highly language-specific
problem, even more so than word breaking, it's a messy linguistic problem
instead of a clean algoritmic problems.
To start with, the choice of the term "extraction" shows that one has not
understood the problem in all its g(l)ory: a more appropriate term would be
"finding", or maybe, "reducing" the root.

Also, I would add

- "syllablization" (is that a word?) as a third problem (for breaking words
more nicely into lines), it would rank in difficulty somewhere between word
breaking and root extraction.

 But for word wrapping I assume line 
 breaking is sufficient. But when I don't have spaces to use 
 for wrapping and/or don't know whether the actual text part 
 uses spaces at all (what about exotic languages like Ogham or 
 Anglo-saxon?) then how can I go to implement word wrapping? 
 Simply do it character by character?
 



RE: extracting words

2001-02-12 Thread Mark Leisher


  - line break (wrapping lines on the screen)  - word break (for
 selection)  - word/root extraction (for search)
 
 I recognize that the second and third case are really difficult to
 handle.

Jarkko Root extraction is decidecly non-trivial and a highly
Jarkko language-specific problem, even more so than word breaking, it's a
Jarkko messy linguistic problem instead of a clean algoritmic problems.
Jarkko To start with, the choice of the term "extraction" shows that one
Jarkko has not understood the problem in all its g(l)ory: a more
Jarkko appropriate term would be "finding", or maybe, "reducing" the
Jarkko root.

The words we use in computational linguistics are "stemming" and less
frequently "lemmatization."  This is often the step in morphological analysis
that precedes determining the part-of-speech.  Jarkko is right that it is a
messy problem for many languages.

Jarkko - "syllablization" (is that a word?) as a third problem (for
Jarkko breaking words more nicely into lines), it would rank in
Jarkko difficulty somewhere between word breaking and root extraction.

I believe "syllabization" or perhaps "syllabification" might be the term.

 But for word wrapping I assume line breaking is sufficient. But when I
 don't have spaces to use for wrapping and/or don't know whether the
 actual text part uses spaces at all (what about exotic languages like
 Ogham or Anglo-saxon?) then how can I go to implement word wrapping?
 Simply do it character by character?
 
Spaces and other punctuation come in handy for line breaking.  Segmentation is
used with scripts that don't use this sort of intra-sentence term separation
(i.e. Chinese, Japanese, Thai).  There are whole conferences devoted to
segmentation approaches.  Another messy area of computational linguistics :-)
If segmentation is not available, then lines are often wrapped between
characters.
-
Mark Leisher  But there is no doubt but money is to the
Computing Research Labfore now.  It is the romance, the poetry
New Mexico State University   of our age.  It's the thing that chiefly
Box 30001, Dept. 3CRL strikes our imagination.
Las Cruces, NM  88003 -- The Rise of Silas Lapham, W. D. Howells



[OT?] Re: extracting words

2001-02-12 Thread DougEwell2

In a message dated 2001-02-12 8:54:10 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 Also, I would add
  
  - "syllablization" (is that a word?) as a third problem (for breaking words
  more nicely into lines), it would rank in difficulty somewhere between word
  breaking and root extraction.

I think the canonical word is "syllabification," but from a word-inventing 
perspective, I agree with Jarkko's first instinct.  The suffix "-ize" seems 
more appropriate to the process being discussed than "-fy".

-Doug Ewell
 Fullerton, California



Re: extracting words

2001-02-12 Thread Michael \(michka\) Kaplan

From: "Kenneth Whistler" [EMAIL PROTECTED]

 the tsek (U+0F0B) that roughly occurs between syllables. Yes, Tibetanists,
I
 know that the term "syllable" is not technically correct here, so please
don't
 nitpick me to death on this one. ;-)

Ironically enough, there are a number of native speakers who struggle with
the fact that "syllable" is apparently the best available word for them, if
all of the usual connotations could be dispensed with.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/





Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Jungshik Shin




On Sun, 11 Feb 2001, Mark Davis wrote:

MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
MD recommended in my last message. The Unicode standard is online, as is the
MD TR. Both can be found by going to www.unicode.org, and selecting the right
MD topic. The TR in particular discusses the recommended approach to line break
MD in great detail.

As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
that leads to the following in the TR on line breaking (and what's written
about it in Chap 5o of TUS 3.0) came from.

UTR14   Korean may alternately use a space-based (style 1) instead of the
UTR14   style 2 context analysis.

UTR14 1.  Korean uses either implicit breaking around
UTR14 Hangul and ideographs or uses spaces. Reference [1] shows
UTR14 how this can be elegantly handled by the second or third
UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL
UTR14 are affected. For alphabetic style line breaking, breaks
UTR14 for these four cases require space, for ideographic style
UTR14 line breaking, these four cases don't require spaces.

where style 1 and style2 are defined as

UTR14 1. Western (spaces and hyphens are used to determine breaks)
UTR14 2. East Asian (lines can break anywhere, unless prohibited)


Let me make it clear that virtually NO books published in Korean uses
space-based (style 1) line breaking rule. Style 2 line breaking rule
is *exclusively* used for modern Korean text no matter what some broken
word processors for Korean offer as an alternative to style 2 and what
some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do.

I'm very alarmed to find this 'misinformation' crept into the UTS and
UTR14 (now UAX #14). It would be nice if  somebody in charge could get
this straightened.

Regards,

Jungshik Shin




Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Mark Davis

Asmus Freytag is the one to talk to; he can look into this.

Mark

- Original Message -
From: "Jungshik Shin" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 13:33
Subject: Korean linebreking and UTR14(was Re: extracting words)





 On Sun, 11 Feb 2001, Mark Davis wrote:

 MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
 MD recommended in my last message. The Unicode standard is online, as is
the
 MD TR. Both can be found by going to www.unicode.org, and selecting the
right
 MD topic. The TR in particular discusses the recommended approach to line
break
 MD in great detail.

 As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
 that leads to the following in the TR on line breaking (and what's written
 about it in Chap 5o of TUS 3.0) came from.

 UTR14   Korean may alternately use a space-based (style 1) instead of the
 UTR14   style 2 context analysis.

 UTR14 1.  Korean uses either implicit breaking around
 UTR14 Hangul and ideographs or uses spaces. Reference [1] shows
 UTR14 how this can be elegantly handled by the second or third
 UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL
 UTR14 are affected. For alphabetic style line breaking, breaks
 UTR14 for these four cases require space, for ideographic style
 UTR14 line breaking, these four cases don't require spaces.

 where style 1 and style2 are defined as

 UTR14 1. Western (spaces and hyphens are used to determine breaks)
 UTR14 2. East Asian (lines can break anywhere, unless prohibited)


 Let me make it clear that virtually NO books published in Korean uses
 space-based (style 1) line breaking rule. Style 2 line breaking rule
 is *exclusively* used for modern Korean text no matter what some broken
 word processors for Korean offer as an alternative to style 2 and what
 some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do.

 I'm very alarmed to find this 'misinformation' crept into the UTS and
 UTR14 (now UAX #14). It would be nice if  somebody in charge could get
 this straightened.

 Regards,

 Jungshik Shin





Re: Korean linebreking and UTR14(was Re: extracting words)

2001-02-12 Thread Jungshik Shin



On Mon, 12 Feb 2001, Mark Davis wrote:

Thank you for your answer.

 Asmus Freytag is the one to talk to; he can look into this.

Do you think I should contact him directly off-line? I thought he's on
this list now as well as  back in March 2000 when I wrote about TUS 3.0
p. 124.

  On Mon, 12  Feb 2001, "Jungshik Shin" [EMAIL PROTECTED] wrote:
  On Sun, 11 Feb 2001, Mark Davis wrote:
 
  MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
  MD recommended in my last message. The Unicode standard is online, as is

  As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
  that leads to the following in the TR on line breaking (and what's written
  about it in Chap 5o of TUS 3.0) came from.
 
  UTR14   Korean may alternately use a space-based (style 1) instead of the
  UTR14   style 2 context analysis.

BTW, this clearly shows that what Rick McGowan wrote about 'either ... or'
in response to what I wrote about Korean line breaking rule (TUS 3.0
p. 124) in March 2000 is not right like I argued then.  I'm sure he's
right about 'either ... or ' in English grammar but the intention of the
author is on my side if the author of UTR 14 is the same as that of the
part  in question in TUS 3.0. I'm enclosing at the end of this message
a part of my message in response to him.


  I'm very alarmed to find this 'misinformation' crept into the UTS and
  UTR14 (now UAX #14). It would be nice if  somebody in charge could get
  this straightened.

This didn't make it in Unicode 3.1, either. What would be the best way
to get it addressed before next revision comes out? I'm afraid just
raising it  on this list wouldn't be sufficient (of course, I should
have followed up more vigorously last year)

Regards,

Jungshik Shin


Enc.

1. Two messages of mine
   the first one : March 1, 2000
   the second one: March 2, 2000

From: Jungshik Shin [EMAIL PROTECTED]
Subject: Korean line breaking rules : Unicode 3.0 (p. 124)
Date: Wed, 1 Mar 2000 19:23:23 -0800 (PST)

On Sun, 13 Feb 2000, Kenneth Whistler wrote:

 Lest anyone feel unduly constrained, let me note that now that
 the editorial committee has closed the book, so to speak, on Unicode 3.0,
 all of you who are about to open the book for the first time should
 feel free to unleash your commentary on the text.

   I've just received my copy of Unicode 3.0 book, here goes
my first commentary.

   On page 124(section 5.15 Locatiing Text element boundaries),
the third paragraph has the following around the end:

U3.0 In particular, word, line, and sentence boundaries will need to
U3.0 be customized according to locale and user preference. In Korean,
U3.0 for example, lines may be broken either at spaces(as in Latin text) or
U3.0 on ideographic boundaries (as in Chinese).

  First of all, it's a great mystery to me how on earth this
strange notion of Korean having *two* different line breaking rules(as
opposed to one)  crept into the expertise of non-Korean experts on Korean
and finally made it into Unicode 3.0 book and Unicode TR on line breaking.

  None of tens of Korean books on my bookshelves
I've just gone through breaks lines *exclusively* at spaces. All of them
break lines freely at *syllables*. Only places where lines are broken
*exclusively* at spaces(for Korean text)  I can think of are completely
*broken*(as far as Korean line breaking is concerned) web browsers like
Netscape and MS IE and possibly earlier implementations of Korean LaTeX.
One may add  to the list Korean text formatted by non-localized version
of 'fmt' (in Unix) as another example. To work around the problem caused
by these broken web browsers, some Korean web authors apply a simple
filter to insert wbr between every pair of Korean syllables to their
html files. To see what I mean, you may wanna take a look at
http://photon.hgs.yale.edu/~jungshik/lb.html and
http://photon.hgs.yale.edu/~jungshik/lbscreenshot.jpg

  Let me emphasize that line can be broken at any syllable boundaries
in Korean text (except for some obvious exceptions as applied in English
text: i.e. punctuation marks like '!', '?' cannot begin a line).

  Secondly, even in Latin scripts(well, at least in English) lines can
be broken not only at spaces but also at syllables(syllabic boundaries)
with hyphen.  Only difference between Korean line breaking and English
line breaking is Korean doesn't need hyphen when lines are broken at
syllables because in Korean syllables  form  another visual unit a level
higher than alphabetic/phonetic letters(consonants and vowels).

  Thirdly,  the expression 'ideographic boundaries' is not appropriate
'syllabic boundaries' or 'syllables'.

  Given these, I'd like to  suggest the last sentence(that begins with
'In Korean, for instance...') be removed in the future edition because
Korean is NOT a good example case where there can be multiple line
breaking rules depending on user preference.

Jungshik Shin

From: Jungshik Shin [EMAIL PROTECTED]
Subject: RE: Korean 

Re: extracting words

2001-02-12 Thread Jungshik Shin




On Sun, 11 Feb 2001, Mark Davis wrote:

 BTW, someone on this thread made this topic out to be even more complex than
 is: that Devanagari and Korean are written without spaces. While that may
 have been the case historically, I believe that the modern text does use
 spaces. Chinese, Japanese and Thai are the main languages written without
 spaces.

As I wrote earlier and you correctly believe, spaces are used to separate
words in Korean text. That has been the case at least since the Korean
Linguistic Society - KLS: Hangul Hakhoe - published the unified rules of
Korean orthography in 1933. This practice of using spaces must have been
predominant well before that because otherwise the Korean Linguistic
Society might not have come up with that. The ortographic standards
of both North and South Korea agree on this point.  More details are
available at http://www.hangeul.or.kr in Korean only. The full text
of various standards at the site - four orthographic standards (KLS :
1933, 1980, North Korea: 1987, South Korea MOE: 1988), transliteration of
foreign words in Hangul(South Korea MOE, 1985), transcrption of Korean in
Roman alphabets - are only available in HWP - one of the most popular word
processors in Korea -  format which can be viewed with Namo HWP viewer
for MS-Windows at http://www.namo.co.kr/download/dwn_hwpv.html. People
in the US may find that the bottom of each page gets cropped if printed
directly from Namo HWP viewer as they're made for A4 paper. A way around
is print to a file (using a PS printer driver) and use ghostscript to
print (using PDFWriter may do the same trick). If interested, drop me
a line off-line and I'll send a copy either in PDF or PS (resized to
better fit US letter paper if necessary)

Jungshik Shin




Re: extracting words

2001-02-11 Thread Jungshik Shin




On Sat, 10 Feb 2001, Edward Cherlin wrote:

 At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:

 I'm writing a C-program that is called Blacklist, It's purpose is to accept
 a string (unicode) and extract words from it, then hash the found words
 according to a hashing algorythm and see if the word is in blacklist
 hashtable.
 
 This is all very straightforward, but the problem is the extracting of
 wordsfrom this string.
 How do i determine what a word is in Japanese or Korean or whatever other
 language? { a space ? }

 No. Chinese and Japanese almost never have spaces between words, and
 they are not required in Korean.

I'm afraid this is a little bit misleading.  In modern Korean orthography,
every word is delimeted by space (Korean Orthographic Rules, article 2 :
1988-01-19, Ministry of Education, ROK). The exception for that rule is
that particles (Josa) have to follow the preceding word without space
(ibid, article 41). There are also some minor exceptions (ibid, article
43, article 47, article 49 )  so you might say you're correct in that
spaces are *not required* in Korean, but the principle of delimeting
every pair of words with a space is still there.


 Yes, we have had it for a long time; no, nobody has solved it
 entirely; and yes, this approach is wrong. Breaking a string into
 words may require a thorough understanding of the vocabulary and
 grammar of the language, and even that may not be enough.

I absolutely argee with you on this point.

 An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
 Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange
 isseoyo (Father is in the bag)?

I don't think this is such a good example for your case for the enormous
difficulties and complexities involved in extracting words (which
I agree on) because the original question was how to extract words
out of 'supposedly orthographically correct sentences'. Your example
(Abeojigabangeisseoyo) clearly violates the (modern) orthographic rule
by glueing together all the words without space (nobody would write
that way).  One of the reasons that spaces are used to separate words
in Korean writing is to break/lift this kind of degeneracy (as taught
in the first grade Korean class). It would have been more appropriate
if you had come up with an example from Japanese or Chinese where spaces
are rarely used to separate words.

Jungshik Shin




Re: extracting words

2001-02-11 Thread Jonathan Lewis

 in the first grade Korean class). It would have been more appropriate
 if you had come up with an example from Japanese or Chinese where spaces
 are rarely used to separate words.

From Japanese, how about:

kokodehakimonowonuidekudasai

This could be

koko de hakimono wo nuide kudasai (take your shoes off here)

or

koko deha kimono wo nuide kudasai (take your clothes off here; in this case
"ha" is pronounced "wa")


Actually, this example was given at an AAMT meeting last year to show that
writing words in kanji + kana rather than just kana makes it easier for MT
software to break down strings into words accurately and therefore produce
better translations.

Best wishes,

Jonathan Lewis

Tokyo Denki University




Re: extracting words

2001-02-11 Thread Mark Davis

Word break is *very* different than linebreak; see Chapter 5 of TUS, and the
Linebreak TR. For linebreak the only tricky language is Thai, since it
requires a dictionary lookup (much like hyphenation in English). Java (and
ICU) supply linebreak mechanisms as a part of the standard API. They also
supply wordbreak, but it is recognized that those are purely heuristic for
languages such as Chinese and Japanese; the APIs are intended for functions
like double-click, not for dividing text into terms for searching. The
latter is a very complex problem, since if done well requires both division
into words, and extraction of roots: e.g. "go" from "went" and "gone". It is
important to keep these very different processes straight:

- line break (wrapping lines on the screen)
- word break (for selection)
- word/root extraction (for search)

BTW, someone on this thread made this topic out to be even more complex than
is: that Devanagari and Korean are written without spaces. While that may
have been the case historically, I believe that the modern text does use
spaces. Chinese, Japanese and Thai are the main languages written without
spaces.

Mark
- Original Message -
From: "Mike Lischke" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Sunday, February 11, 2001 09:47
Subject: FW: extracting words



 Yes, we have had it for a long time; no, nobody has solved it
 entirely; and yes, this approach is wrong. Breaking a string into
 words may require a thorough understanding of the vocabulary and
 grammar of the language, and even that may not be enough.

But how can we then ever have a reliable word-break algorithm? It cannot be
that, say, for a simple editor (be it written in Java or whatever) you have
to supply a database with language specific details just to do automatic
word wrap.

Ciao, Mike






Re: extracting words

2001-02-11 Thread Mark Davis

Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
recommended in my last message. The Unicode standard is online, as is the
TR. Both can be found by going to www.unicode.org, and selecting the right
topic. The TR in particular discusses the recommended approach to line break
in great detail.

However, as with all Unicode functionality, you should not try to reinvent
the wheel. See if you can use the services of the OS/Platform, or get a
Unicode library (such as ICU or Basis), to cover your requirements. You
mentioned Java; it has an API for line break.

Mark

P.S. It also helps communication if we use the same terms, e.g. "line
break", not "word wrapping".
P.P.S. As to the list settings: if we change it to please you, we would
annoy someone else. We cannot simultaneously please everyone. And please,
nobody start another thread on this topic.

- Original Message -
From: "Mike Lischke" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Sunday, February 11, 2001 11:32
Subject: re: extracting words




 - line break (wrapping lines on the screen)
 - word break (for selection)
 - word/root extraction (for search)

I recognize that the second and third case are really difficult to handle.
But for word wrapping I assume line breaking is sufficient. But when I don't
have spaces to use for wrapping and/or don't know whether the actual text
part uses spaces at all (what about exotic languages like Ogham or
Anglo-saxon?) then how can I go to implement word wrapping? Simply do it
character by character?

Ciao, Mike

PS: sorry for sending this mail first to you privately, but those
unpractical list settings make me always to send to the wrong place first.
It is difficult for me to get used to these strange settings. I'm answering
about 50 mails per day with a simple "reply", so I simply forget all the
time that I have to "reply all" (and the out-of-office bounces I get to my
private mail whenever I send a message to the Unicode list don't make the
task easier).





Re: extracting words

2001-02-10 Thread Edward Cherlin

At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
Hello all,

I'm writing a C-program that is called Blacklist, It's purpose is to accept
a string (unicode) and extract words from it, then hash the found words
according to a hashing algorythm and see if the word is in blacklist
hashtable.

This is all very straightforward, but the problem is the extracting of
wordsfrom this string.
How do i determine what a word is in Japanese or Korean or whatever other
language? { a space ? }

No. Chinese and Japanese almost never have spaces between words, and 
they are not required in Korean.

In Devanagari and related scripts a consonant at the end of a word 
can join with a vowel at the beginning of the next word in a single 
symbol, so you can't just divide the string into segments. There are 
other complications in other writing systems.

The problem is not trivial in Latin alphabet writing, either. 
Hyphenated expressions can be quasi-unified words where one or more 
components is not a separate word, or ad-hoc, even one-time-only 
phrases. The definition of words in a language is also changing. 
"Cannot" is currently one word, but used to be two. "An adder" used 
to be "a nadder".

I think somebody must have had this problem and solved it, or maybe my
approach to the problem is wrong.

Yes, we have had it for a long time; no, nobody has solved it 
entirely; and yes, this approach is wrong. Breaking a string into 
words may require a thorough understanding of the vocabulary and 
grammar of the language, and even that may not be enough.

An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange 
isseoyo (Father is in the bag)?

I hope somebody can give me some good pointers, directions or suggestions.

Thanks for your time,


Brahim Mouhdi

{42.}

-- 

Edward Cherlin, Spamfighter http://www.cauce.org
"It isn't what you don't know that hurts you, it's
what you know that ain't so."--Mark Twain, or else
some other prominent 19th century humorist and wit



RE: extracting words

2001-02-10 Thread Makarand Gadre

Like Edward saud, Getting words from a string is nontrivial. You get similar
issues in Thai. Thai coes not have any space between words, but the script
is Indic based (phonetic). You have to continuously look up the speller and
even then it can't be correct for all cases. E.g.

Sunday or therapist could be interpreted as two words sun  day while the
user meant Sunday etc. In sanskrit, you can create new words by doing a
"sandhi" or conjunction.


Makarand


-Original Message-
From: Edward Cherlin [mailto:[EMAIL PROTECTED]] 
Sent: Sunday, 11 February, 2001 05:34
To: Unicode List
Subject: Re: extracting words


At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
Hello all,

I'm writing a C-program that is called Blacklist, It's purpose is to 
accept a string (unicode) and extract words from it, then hash the 
found words according to a hashing algorythm and see if the word is in 
blacklist hashtable.

This is all very straightforward, but the problem is the extracting of 
wordsfrom this string. How do i determine what a word is in Japanese or 
Korean or whatever other language? { a space ? }

No. Chinese and Japanese almost never have spaces between words, and 
they are not required in Korean.

In Devanagari and related scripts a consonant at the end of a word 
can join with a vowel at the beginning of the next word in a single 
symbol, so you can't just divide the string into segments. There are 
other complications in other writing systems.

The problem is not trivial in Latin alphabet writing, either. 
Hyphenated expressions can be quasi-unified words where one or more 
components is not a separate word, or ad-hoc, even one-time-only 
phrases. The definition of words in a language is also changing. 
"Cannot" is currently one word, but used to be two. "An adder" used 
to be "a nadder".

I think somebody must have had this problem and solved it, or maybe my 
approach to the problem is wrong.

Yes, we have had it for a long time; no, nobody has solved it 
entirely; and yes, this approach is wrong. Breaking a string into 
words may require a thorough understanding of the vocabulary and 
grammar of the language, and even that may not be enough.

An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange 
isseoyo (Father is in the bag)?

I hope somebody can give me some good pointers, directions or 
suggestions.

Thanks for your time,


Brahim Mouhdi

{42.}

-- 

Edward Cherlin, Spamfighter http://www.cauce.org
"It isn't what you don't know that hurts you, it's
what you know that ain't so."--Mark Twain, or else
some other prominent 19th century humorist and wit



RE: extracting words

2001-01-29 Thread Christopher John Fynn

You might have to apply different rules dependant on the script. In Indic scripts 
there are often no explicit word boundary markers and you may have to look for 
grammatical particles. In Tibetan, a string of letters and vowels between two tsheg 
[0F0B / 0F0C] characters (or other "punctuation") is  a morpheme (not that different 
from a word) - but there are many complex words consisting of two or more such 
morphemes. (I don't know any CJK languages but I suspect that most individual 
characters in that block are morphemes as well.)

BTW without determining the language as well as the script, how do you propose to 
determine if a particular string actually matches a word in your "blacklist" (in terms 
of meaning) or not? The same string of characters might mean completely different 
things in two languages that share the same script (/Unicode block).

- Chris

 -Original Message-
 From: Brahim Mouhdi [mailto:[EMAIL PROTECTED]]
 Sent: Monday, January 29, 2001 1:03 AM
 To: Unicode List
 Subject: extracting words
 
 
 
 Hello all,
 
 I'm writing a C-program that is called Blacklist, It's purpose is to accept
 a string (unicode) and extract words from it, then hash the found words
 according to a hashing algorythm and see if the word is in blacklist
 hashtable.
 
 This is all very straightforward, but the problem is the extracting of
 wordsfrom this string.
 How do i determine what a word is in Japanese or Korean or whatever other
 language? { a space ? }
 I think somebody must have had this problem and solved it, or maybe my
 approach to the problem is wrong.
 
 I hope somebody can give me some good pointers, directions or suggestions.
 
 Thanks for your time,
 
 
 Brahim Mouhdi
 
 {42.}




Re: extracting words

2001-01-29 Thread Lukas Pietsch


Christopher Fynn wrote:

BTW without determining the language as well as the script, how do you
propose to determine if a particular string actually matches a word in
your "blacklist" (in terms of meaning) or not? The same string of
characters might mean completely different things in two languages that
share the same script (/Unicode block).

This is assuming that what we want is not just a matching of
*orthographical* words (character strings), but of *lexicographical* words
(aka lexemes). Which of course brings with it even more problems. If you
want to filter out all occurrences of, say, a particular verb, you'll have
to look out for all possible grammatical forms of that verb. 5 forms at
maximum in English (go, goes, went, gone, going), but maybe several
hundreds in a heavily inflectional or agglutinative language. In some
languages the set of possible forms of a lexeme may even be open-ended. No
way of doing that without a full-blown morphological parser (which of
course would have to be language-specific.) Looks like this goes a bit
beyond what Brahim is planning to do.

Lukas





Re: extracting words

2001-01-29 Thread John Cowan

Lukas Pietsch wrote:


 This is assuming that what we want is not just a matching of
 *orthographical* words (character strings), but of *lexicographical* words
 (aka lexemes). 

But it is impossible in fully cross-linguistic situations in general.
There is simply nothing to do about the fact that "such" is a very
common word, perfectly harmless, in the English language; whereas
in the Nootka language (an Amerindian lg. of the U.S. Pacific
Northwest) it is a vulgarism for the external femal genitalia.
A properly multilingual vulgarism-remover would have to
determine whether the document was English or Nootka before
deciding whether to block "such".

-- 
There is / one art || John Cowan [EMAIL PROTECTED]
no more / no less  || http://www.reutershealth.com
to do / all things || http://www.ccil.org/~cowan
with art- / lessness   \\ -- Piet Hein