Re: Korean linebreking and UTR14(was Re: extracting words)
If I want to get anyone's attention, I would send them a direct message. Many people on the list, myself included, get swamped at times and don't necessarily look at every message. Mark - Original Message - From: "Jungshik Shin" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 20:30 Subject: Re: Korean linebreking and UTR14(was Re: extracting words) On Mon, 12 Feb 2001, Mark Davis wrote: Thank you for your answer. Asmus Freytag is the one to talk to; he can look into this. Do you think I should contact him directly off-line? I thought he's on this list now as well as back in March 2000 when I wrote about TUS 3.0 p. 124. On Mon, 12 Feb 2001, "Jungshik Shin" [EMAIL PROTECTED] wrote: On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is As I wrote when TUS 3.0 came out, I cannot help wondering where the idea that leads to the following in the TR on line breaking (and what's written about it in Chap 5o of TUS 3.0) came from. UTR14 Korean may alternately use a space-based (style 1) instead of the UTR14 style 2 context analysis. BTW, this clearly shows that what Rick McGowan wrote about 'either ... or' in response to what I wrote about Korean line breaking rule (TUS 3.0 p. 124) in March 2000 is not right like I argued then. I'm sure he's right about 'either ... or ' in English grammar but the intention of the author is on my side if the author of UTR 14 is the same as that of the part in question in TUS 3.0. I'm enclosing at the end of this message a part of my message in response to him. I'm very alarmed to find this 'misinformation' crept into the UTS and UTR14 (now UAX #14). It would be nice if somebody in charge could get this straightened. This didn't make it in Unicode 3.1, either. What would be the best way to get it addressed before next revision comes out? I'm afraid just raising it on this list wouldn't be sufficient (of course, I should have followed up more vigorously last year) Regards, Jungshik Shin Enc. 1. Two messages of mine the first one : March 1, 2000 the second one: March 2, 2000 From: Jungshik Shin [EMAIL PROTECTED] Subject: Korean line breaking rules : Unicode 3.0 (p. 124) Date: Wed, 1 Mar 2000 19:23:23 -0800 (PST) On Sun, 13 Feb 2000, Kenneth Whistler wrote: Lest anyone feel unduly constrained, let me note that now that the editorial committee has closed the book, so to speak, on Unicode 3.0, all of you who are about to open the book for the first time should feel free to unleash your commentary on the text. I've just received my copy of Unicode 3.0 book, here goes my first commentary. On page 124(section 5.15 Locatiing Text element boundaries), the third paragraph has the following around the end: U3.0 In particular, word, line, and sentence boundaries will need to U3.0 be customized according to locale and user preference. In Korean, U3.0 for example, lines may be broken either at spaces(as in Latin text) or U3.0 on ideographic boundaries (as in Chinese). First of all, it's a great mystery to me how on earth this strange notion of Korean having *two* different line breaking rules(as opposed to one) crept into the expertise of non-Korean experts on Korean and finally made it into Unicode 3.0 book and Unicode TR on line breaking. None of tens of Korean books on my bookshelves I've just gone through breaks lines *exclusively* at spaces. All of them break lines freely at *syllables*. Only places where lines are broken *exclusively* at spaces(for Korean text) I can think of are completely *broken*(as far as Korean line breaking is concerned) web browsers like Netscape and MS IE and possibly earlier implementations of Korean LaTeX. One may add to the list Korean text formatted by non-localized version of 'fmt' (in Unix) as another example. To work around the problem caused by these broken web browsers, some Korean web authors apply a simple filter to insert wbr between every pair of Korean syllables to their html files. To see what I mean, you may wanna take a look at http://photon.hgs.yale.edu/~jungshik/lb.html and http://photon.hgs.yale.edu/~jungshik/lbscreenshot.jpg Let me emphasize that line can be broken at any syllable boundaries in Korean text (except for some obvious exceptions as applied in English text: i.e. punctuation marks like '!', '?' cannot begin a line). Secondly, even in Latin scripts(well, at least in English) lines can be broken not only at spaces but also at syllables(syllabic boundaries) with hyphen. Only difference between Korean line breaking and English line breaking is Korean doesn't need hyphen when lines are broken at syllables because in Korean syllables form another visual unit a
RE: extracting words
Mark Davis wrote: BTW, someone on this thread made this topic out to be even more complex than is: that Devanagari and Korean are written without spaces. While that may have been the case historically, I believe that the modern text does use spaces. Chinese, Japanese and Thai are the main languages written without spaces. Several Indic languages/scripts do not use spaces (or other marker characters) between words or syllables. I don't think you can even rely on spaces between words for all the different Indic languages that use only the devanagari script. Tibetan script has a "syllable" (or morpheme) separator [U+0F0B] which provides a line break opportunity - but in modern Dzongkha (Bhutanese) this character is dropped in many places where a reader can determine the boundary by grammatical rules. BTW In traditional Tibetan orthography, a space is *not* a line break opportunity. - Chris -- Chris Fynn DDC Dzongkha Computing Project Thimphu, Bhutan.
RE: extracting words
BTW In traditional Tibetan orthography, a space is *not* a line break opportunity. What's the role of a space in there, then? - Chris -- Chris Fynn DDC Dzongkha Computing Project Thimphu, Bhutan.
RE: extracting words
- line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. Root extraction is decidecly non-trivial and a highly language-specific problem, even more so than word breaking, it's a messy linguistic problem instead of a clean algoritmic problems. To start with, the choice of the term "extraction" shows that one has not understood the problem in all its g(l)ory: a more appropriate term would be "finding", or maybe, "reducing" the root. Also, I would add - "syllablization" (is that a word?) as a third problem (for breaking words more nicely into lines), it would rank in difficulty somewhere between word breaking and root extraction. But for word wrapping I assume line breaking is sufficient. But when I don't have spaces to use for wrapping and/or don't know whether the actual text part uses spaces at all (what about exotic languages like Ogham or Anglo-saxon?) then how can I go to implement word wrapping? Simply do it character by character?
RE: extracting words
- line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. Jarkko Root extraction is decidecly non-trivial and a highly Jarkko language-specific problem, even more so than word breaking, it's a Jarkko messy linguistic problem instead of a clean algoritmic problems. Jarkko To start with, the choice of the term "extraction" shows that one Jarkko has not understood the problem in all its g(l)ory: a more Jarkko appropriate term would be "finding", or maybe, "reducing" the Jarkko root. The words we use in computational linguistics are "stemming" and less frequently "lemmatization." This is often the step in morphological analysis that precedes determining the part-of-speech. Jarkko is right that it is a messy problem for many languages. Jarkko - "syllablization" (is that a word?) as a third problem (for Jarkko breaking words more nicely into lines), it would rank in Jarkko difficulty somewhere between word breaking and root extraction. I believe "syllabization" or perhaps "syllabification" might be the term. But for word wrapping I assume line breaking is sufficient. But when I don't have spaces to use for wrapping and/or don't know whether the actual text part uses spaces at all (what about exotic languages like Ogham or Anglo-saxon?) then how can I go to implement word wrapping? Simply do it character by character? Spaces and other punctuation come in handy for line breaking. Segmentation is used with scripts that don't use this sort of intra-sentence term separation (i.e. Chinese, Japanese, Thai). There are whole conferences devoted to segmentation approaches. Another messy area of computational linguistics :-) If segmentation is not available, then lines are often wrapped between characters. - Mark Leisher But there is no doubt but money is to the Computing Research Labfore now. It is the romance, the poetry New Mexico State University of our age. It's the thing that chiefly Box 30001, Dept. 3CRL strikes our imagination. Las Cruces, NM 88003 -- The Rise of Silas Lapham, W. D. Howells
[OT?] Re: extracting words
In a message dated 2001-02-12 8:54:10 Pacific Standard Time, [EMAIL PROTECTED] writes: Also, I would add - "syllablization" (is that a word?) as a third problem (for breaking words more nicely into lines), it would rank in difficulty somewhere between word breaking and root extraction. I think the canonical word is "syllabification," but from a word-inventing perspective, I agree with Jarkko's first instinct. The suffix "-ize" seems more appropriate to the process being discussed than "-fy". -Doug Ewell Fullerton, California
Re: extracting words
From: "Kenneth Whistler" [EMAIL PROTECTED] the tsek (U+0F0B) that roughly occurs between syllables. Yes, Tibetanists, I know that the term "syllable" is not technically correct here, so please don't nitpick me to death on this one. ;-) Ironically enough, there are a number of native speakers who struggle with the fact that "syllable" is apparently the best available word for them, if all of the usual connotations could be dispensed with. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Korean linebreking and UTR14(was Re: extracting words)
On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is the MD TR. Both can be found by going to www.unicode.org, and selecting the right MD topic. The TR in particular discusses the recommended approach to line break MD in great detail. As I wrote when TUS 3.0 came out, I cannot help wondering where the idea that leads to the following in the TR on line breaking (and what's written about it in Chap 5o of TUS 3.0) came from. UTR14 Korean may alternately use a space-based (style 1) instead of the UTR14 style 2 context analysis. UTR14 1. Korean uses either implicit breaking around UTR14 Hangul and ideographs or uses spaces. Reference [1] shows UTR14 how this can be elegantly handled by the second or third UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL UTR14 are affected. For alphabetic style line breaking, breaks UTR14 for these four cases require space, for ideographic style UTR14 line breaking, these four cases don't require spaces. where style 1 and style2 are defined as UTR14 1. Western (spaces and hyphens are used to determine breaks) UTR14 2. East Asian (lines can break anywhere, unless prohibited) Let me make it clear that virtually NO books published in Korean uses space-based (style 1) line breaking rule. Style 2 line breaking rule is *exclusively* used for modern Korean text no matter what some broken word processors for Korean offer as an alternative to style 2 and what some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do. I'm very alarmed to find this 'misinformation' crept into the UTS and UTR14 (now UAX #14). It would be nice if somebody in charge could get this straightened. Regards, Jungshik Shin
Re: Korean linebreking and UTR14(was Re: extracting words)
Asmus Freytag is the one to talk to; he can look into this. Mark - Original Message - From: "Jungshik Shin" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 13:33 Subject: Korean linebreking and UTR14(was Re: extracting words) On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is the MD TR. Both can be found by going to www.unicode.org, and selecting the right MD topic. The TR in particular discusses the recommended approach to line break MD in great detail. As I wrote when TUS 3.0 came out, I cannot help wondering where the idea that leads to the following in the TR on line breaking (and what's written about it in Chap 5o of TUS 3.0) came from. UTR14 Korean may alternately use a space-based (style 1) instead of the UTR14 style 2 context analysis. UTR14 1. Korean uses either implicit breaking around UTR14 Hangul and ideographs or uses spaces. Reference [1] shows UTR14 how this can be elegantly handled by the second or third UTR14 method. Only the intersection of ID/ID, AL/ID and ID/AL UTR14 are affected. For alphabetic style line breaking, breaks UTR14 for these four cases require space, for ideographic style UTR14 line breaking, these four cases don't require spaces. where style 1 and style2 are defined as UTR14 1. Western (spaces and hyphens are used to determine breaks) UTR14 2. East Asian (lines can break anywhere, unless prohibited) Let me make it clear that virtually NO books published in Korean uses space-based (style 1) line breaking rule. Style 2 line breaking rule is *exclusively* used for modern Korean text no matter what some broken word processors for Korean offer as an alternative to style 2 and what some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do. I'm very alarmed to find this 'misinformation' crept into the UTS and UTR14 (now UAX #14). It would be nice if somebody in charge could get this straightened. Regards, Jungshik Shin
Re: Korean linebreking and UTR14(was Re: extracting words)
On Mon, 12 Feb 2001, Mark Davis wrote: Thank you for your answer. Asmus Freytag is the one to talk to; he can look into this. Do you think I should contact him directly off-line? I thought he's on this list now as well as back in March 2000 when I wrote about TUS 3.0 p. 124. On Mon, 12 Feb 2001, "Jungshik Shin" [EMAIL PROTECTED] wrote: On Sun, 11 Feb 2001, Mark Davis wrote: MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I MD recommended in my last message. The Unicode standard is online, as is As I wrote when TUS 3.0 came out, I cannot help wondering where the idea that leads to the following in the TR on line breaking (and what's written about it in Chap 5o of TUS 3.0) came from. UTR14 Korean may alternately use a space-based (style 1) instead of the UTR14 style 2 context analysis. BTW, this clearly shows that what Rick McGowan wrote about 'either ... or' in response to what I wrote about Korean line breaking rule (TUS 3.0 p. 124) in March 2000 is not right like I argued then. I'm sure he's right about 'either ... or ' in English grammar but the intention of the author is on my side if the author of UTR 14 is the same as that of the part in question in TUS 3.0. I'm enclosing at the end of this message a part of my message in response to him. I'm very alarmed to find this 'misinformation' crept into the UTS and UTR14 (now UAX #14). It would be nice if somebody in charge could get this straightened. This didn't make it in Unicode 3.1, either. What would be the best way to get it addressed before next revision comes out? I'm afraid just raising it on this list wouldn't be sufficient (of course, I should have followed up more vigorously last year) Regards, Jungshik Shin Enc. 1. Two messages of mine the first one : March 1, 2000 the second one: March 2, 2000 From: Jungshik Shin [EMAIL PROTECTED] Subject: Korean line breaking rules : Unicode 3.0 (p. 124) Date: Wed, 1 Mar 2000 19:23:23 -0800 (PST) On Sun, 13 Feb 2000, Kenneth Whistler wrote: Lest anyone feel unduly constrained, let me note that now that the editorial committee has closed the book, so to speak, on Unicode 3.0, all of you who are about to open the book for the first time should feel free to unleash your commentary on the text. I've just received my copy of Unicode 3.0 book, here goes my first commentary. On page 124(section 5.15 Locatiing Text element boundaries), the third paragraph has the following around the end: U3.0 In particular, word, line, and sentence boundaries will need to U3.0 be customized according to locale and user preference. In Korean, U3.0 for example, lines may be broken either at spaces(as in Latin text) or U3.0 on ideographic boundaries (as in Chinese). First of all, it's a great mystery to me how on earth this strange notion of Korean having *two* different line breaking rules(as opposed to one) crept into the expertise of non-Korean experts on Korean and finally made it into Unicode 3.0 book and Unicode TR on line breaking. None of tens of Korean books on my bookshelves I've just gone through breaks lines *exclusively* at spaces. All of them break lines freely at *syllables*. Only places where lines are broken *exclusively* at spaces(for Korean text) I can think of are completely *broken*(as far as Korean line breaking is concerned) web browsers like Netscape and MS IE and possibly earlier implementations of Korean LaTeX. One may add to the list Korean text formatted by non-localized version of 'fmt' (in Unix) as another example. To work around the problem caused by these broken web browsers, some Korean web authors apply a simple filter to insert wbr between every pair of Korean syllables to their html files. To see what I mean, you may wanna take a look at http://photon.hgs.yale.edu/~jungshik/lb.html and http://photon.hgs.yale.edu/~jungshik/lbscreenshot.jpg Let me emphasize that line can be broken at any syllable boundaries in Korean text (except for some obvious exceptions as applied in English text: i.e. punctuation marks like '!', '?' cannot begin a line). Secondly, even in Latin scripts(well, at least in English) lines can be broken not only at spaces but also at syllables(syllabic boundaries) with hyphen. Only difference between Korean line breaking and English line breaking is Korean doesn't need hyphen when lines are broken at syllables because in Korean syllables form another visual unit a level higher than alphabetic/phonetic letters(consonants and vowels). Thirdly, the expression 'ideographic boundaries' is not appropriate 'syllabic boundaries' or 'syllables'. Given these, I'd like to suggest the last sentence(that begins with 'In Korean, for instance...') be removed in the future edition because Korean is NOT a good example case where there can be multiple line breaking rules depending on user preference. Jungshik Shin From: Jungshik Shin [EMAIL PROTECTED] Subject: RE: Korean
Re: extracting words
On Sun, 11 Feb 2001, Mark Davis wrote: BTW, someone on this thread made this topic out to be even more complex than is: that Devanagari and Korean are written without spaces. While that may have been the case historically, I believe that the modern text does use spaces. Chinese, Japanese and Thai are the main languages written without spaces. As I wrote earlier and you correctly believe, spaces are used to separate words in Korean text. That has been the case at least since the Korean Linguistic Society - KLS: Hangul Hakhoe - published the unified rules of Korean orthography in 1933. This practice of using spaces must have been predominant well before that because otherwise the Korean Linguistic Society might not have come up with that. The ortographic standards of both North and South Korea agree on this point. More details are available at http://www.hangeul.or.kr in Korean only. The full text of various standards at the site - four orthographic standards (KLS : 1933, 1980, North Korea: 1987, South Korea MOE: 1988), transliteration of foreign words in Hangul(South Korea MOE, 1985), transcrption of Korean in Roman alphabets - are only available in HWP - one of the most popular word processors in Korea - format which can be viewed with Namo HWP viewer for MS-Windows at http://www.namo.co.kr/download/dwn_hwpv.html. People in the US may find that the bottom of each page gets cropped if printed directly from Namo HWP viewer as they're made for A4 paper. A way around is print to a file (using a PS printer driver) and use ghostscript to print (using PDFWriter may do the same trick). If interested, drop me a line off-line and I'll send a copy either in PDF or PS (resized to better fit US letter paper if necessary) Jungshik Shin
Re: extracting words
On Sat, 10 Feb 2001, Edward Cherlin wrote: At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the word is in blacklist hashtable. This is all very straightforward, but the problem is the extracting of wordsfrom this string. How do i determine what a word is in Japanese or Korean or whatever other language? { a space ? } No. Chinese and Japanese almost never have spaces between words, and they are not required in Korean. I'm afraid this is a little bit misleading. In modern Korean orthography, every word is delimeted by space (Korean Orthographic Rules, article 2 : 1988-01-19, Ministry of Education, ROK). The exception for that rule is that particles (Josa) have to follow the preceding word without space (ibid, article 41). There are also some minor exceptions (ibid, article 43, article 47, article 49 ) so you might say you're correct in that spaces are *not required* in Korean, but the principle of delimeting every pair of words with a space is still there. Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. I absolutely argee with you on this point. An example from Korean: Abeojigabangeisseoyo. Should this be segmented as Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange isseoyo (Father is in the bag)? I don't think this is such a good example for your case for the enormous difficulties and complexities involved in extracting words (which I agree on) because the original question was how to extract words out of 'supposedly orthographically correct sentences'. Your example (Abeojigabangeisseoyo) clearly violates the (modern) orthographic rule by glueing together all the words without space (nobody would write that way). One of the reasons that spaces are used to separate words in Korean writing is to break/lift this kind of degeneracy (as taught in the first grade Korean class). It would have been more appropriate if you had come up with an example from Japanese or Chinese where spaces are rarely used to separate words. Jungshik Shin
Re: extracting words
in the first grade Korean class). It would have been more appropriate if you had come up with an example from Japanese or Chinese where spaces are rarely used to separate words. From Japanese, how about: kokodehakimonowonuidekudasai This could be koko de hakimono wo nuide kudasai (take your shoes off here) or koko deha kimono wo nuide kudasai (take your clothes off here; in this case "ha" is pronounced "wa") Actually, this example was given at an AAMT meeting last year to show that writing words in kanji + kana rather than just kana makes it easier for MT software to break down strings into words accurately and therefore produce better translations. Best wishes, Jonathan Lewis Tokyo Denki University
Re: extracting words
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the Linebreak TR. For linebreak the only tricky language is Thai, since it requires a dictionary lookup (much like hyphenation in English). Java (and ICU) supply linebreak mechanisms as a part of the standard API. They also supply wordbreak, but it is recognized that those are purely heuristic for languages such as Chinese and Japanese; the APIs are intended for functions like double-click, not for dividing text into terms for searching. The latter is a very complex problem, since if done well requires both division into words, and extraction of roots: e.g. "go" from "went" and "gone". It is important to keep these very different processes straight: - line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) BTW, someone on this thread made this topic out to be even more complex than is: that Devanagari and Korean are written without spaces. While that may have been the case historically, I believe that the modern text does use spaces. Chinese, Japanese and Thai are the main languages written without spaces. Mark - Original Message - From: "Mike Lischke" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Sunday, February 11, 2001 09:47 Subject: FW: extracting words Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. But how can we then ever have a reliable word-break algorithm? It cannot be that, say, for a simple editor (be it written in Java or whatever) you have to supply a database with language specific details just to do automatic word wrap. Ciao, Mike
Re: extracting words
Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I recommended in my last message. The Unicode standard is online, as is the TR. Both can be found by going to www.unicode.org, and selecting the right topic. The TR in particular discusses the recommended approach to line break in great detail. However, as with all Unicode functionality, you should not try to reinvent the wheel. See if you can use the services of the OS/Platform, or get a Unicode library (such as ICU or Basis), to cover your requirements. You mentioned Java; it has an API for line break. Mark P.S. It also helps communication if we use the same terms, e.g. "line break", not "word wrapping". P.P.S. As to the list settings: if we change it to please you, we would annoy someone else. We cannot simultaneously please everyone. And please, nobody start another thread on this topic. - Original Message - From: "Mike Lischke" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Sunday, February 11, 2001 11:32 Subject: re: extracting words - line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. But for word wrapping I assume line breaking is sufficient. But when I don't have spaces to use for wrapping and/or don't know whether the actual text part uses spaces at all (what about exotic languages like Ogham or Anglo-saxon?) then how can I go to implement word wrapping? Simply do it character by character? Ciao, Mike PS: sorry for sending this mail first to you privately, but those unpractical list settings make me always to send to the wrong place first. It is difficult for me to get used to these strange settings. I'm answering about 50 mails per day with a simple "reply", so I simply forget all the time that I have to "reply all" (and the out-of-office bounces I get to my private mail whenever I send a message to the Unicode list don't make the task easier).
Re: extracting words
At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: Hello all, I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the word is in blacklist hashtable. This is all very straightforward, but the problem is the extracting of wordsfrom this string. How do i determine what a word is in Japanese or Korean or whatever other language? { a space ? } No. Chinese and Japanese almost never have spaces between words, and they are not required in Korean. In Devanagari and related scripts a consonant at the end of a word can join with a vowel at the beginning of the next word in a single symbol, so you can't just divide the string into segments. There are other complications in other writing systems. The problem is not trivial in Latin alphabet writing, either. Hyphenated expressions can be quasi-unified words where one or more components is not a separate word, or ad-hoc, even one-time-only phrases. The definition of words in a language is also changing. "Cannot" is currently one word, but used to be two. "An adder" used to be "a nadder". I think somebody must have had this problem and solved it, or maybe my approach to the problem is wrong. Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. An example from Korean: Abeojigabangeisseoyo. Should this be segmented as Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange isseoyo (Father is in the bag)? I hope somebody can give me some good pointers, directions or suggestions. Thanks for your time, Brahim Mouhdi {42.} -- Edward Cherlin, Spamfighter http://www.cauce.org "It isn't what you don't know that hurts you, it's what you know that ain't so."--Mark Twain, or else some other prominent 19th century humorist and wit
RE: extracting words
Like Edward saud, Getting words from a string is nontrivial. You get similar issues in Thai. Thai coes not have any space between words, but the script is Indic based (phonetic). You have to continuously look up the speller and even then it can't be correct for all cases. E.g. Sunday or therapist could be interpreted as two words sun day while the user meant Sunday etc. In sanskrit, you can create new words by doing a "sandhi" or conjunction. Makarand -Original Message- From: Edward Cherlin [mailto:[EMAIL PROTECTED]] Sent: Sunday, 11 February, 2001 05:34 To: Unicode List Subject: Re: extracting words At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: Hello all, I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the word is in blacklist hashtable. This is all very straightforward, but the problem is the extracting of wordsfrom this string. How do i determine what a word is in Japanese or Korean or whatever other language? { a space ? } No. Chinese and Japanese almost never have spaces between words, and they are not required in Korean. In Devanagari and related scripts a consonant at the end of a word can join with a vowel at the beginning of the next word in a single symbol, so you can't just divide the string into segments. There are other complications in other writing systems. The problem is not trivial in Latin alphabet writing, either. Hyphenated expressions can be quasi-unified words where one or more components is not a separate word, or ad-hoc, even one-time-only phrases. The definition of words in a language is also changing. "Cannot" is currently one word, but used to be two. "An adder" used to be "a nadder". I think somebody must have had this problem and solved it, or maybe my approach to the problem is wrong. Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. An example from Korean: Abeojigabangeisseoyo. Should this be segmented as Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange isseoyo (Father is in the bag)? I hope somebody can give me some good pointers, directions or suggestions. Thanks for your time, Brahim Mouhdi {42.} -- Edward Cherlin, Spamfighter http://www.cauce.org "It isn't what you don't know that hurts you, it's what you know that ain't so."--Mark Twain, or else some other prominent 19th century humorist and wit
RE: extracting words
You might have to apply different rules dependant on the script. In Indic scripts there are often no explicit word boundary markers and you may have to look for grammatical particles. In Tibetan, a string of letters and vowels between two tsheg [0F0B / 0F0C] characters (or other "punctuation") is a morpheme (not that different from a word) - but there are many complex words consisting of two or more such morphemes. (I don't know any CJK languages but I suspect that most individual characters in that block are morphemes as well.) BTW without determining the language as well as the script, how do you propose to determine if a particular string actually matches a word in your "blacklist" (in terms of meaning) or not? The same string of characters might mean completely different things in two languages that share the same script (/Unicode block). - Chris -Original Message- From: Brahim Mouhdi [mailto:[EMAIL PROTECTED]] Sent: Monday, January 29, 2001 1:03 AM To: Unicode List Subject: extracting words Hello all, I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the word is in blacklist hashtable. This is all very straightforward, but the problem is the extracting of wordsfrom this string. How do i determine what a word is in Japanese or Korean or whatever other language? { a space ? } I think somebody must have had this problem and solved it, or maybe my approach to the problem is wrong. I hope somebody can give me some good pointers, directions or suggestions. Thanks for your time, Brahim Mouhdi {42.}
Re: extracting words
Christopher Fynn wrote: BTW without determining the language as well as the script, how do you propose to determine if a particular string actually matches a word in your "blacklist" (in terms of meaning) or not? The same string of characters might mean completely different things in two languages that share the same script (/Unicode block). This is assuming that what we want is not just a matching of *orthographical* words (character strings), but of *lexicographical* words (aka lexemes). Which of course brings with it even more problems. If you want to filter out all occurrences of, say, a particular verb, you'll have to look out for all possible grammatical forms of that verb. 5 forms at maximum in English (go, goes, went, gone, going), but maybe several hundreds in a heavily inflectional or agglutinative language. In some languages the set of possible forms of a lexeme may even be open-ended. No way of doing that without a full-blown morphological parser (which of course would have to be language-specific.) Looks like this goes a bit beyond what Brahim is planning to do. Lukas
Re: extracting words
Lukas Pietsch wrote: This is assuming that what we want is not just a matching of *orthographical* words (character strings), but of *lexicographical* words (aka lexemes). But it is impossible in fully cross-linguistic situations in general. There is simply nothing to do about the fact that "such" is a very common word, perfectly harmless, in the English language; whereas in the Nootka language (an Amerindian lg. of the U.S. Pacific Northwest) it is a vulgarism for the external femal genitalia. A properly multilingual vulgarism-remover would have to determine whether the document was English or Nootka before deciding whether to block "such". -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less || http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein