DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27182>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27182 [PATCH] Thai Analysis Enhancement ------- Additional Comments From [EMAIL PROTECTED] 2004-02-24 13:13 ------- A note from the original contribution email, for archiving purposes: Thai is one of those languages that has no whitespace between words. Because of this, Lucene StandardTokenizer can't tokenize a Thai sentence and return the whole sentence as a token. JDK 1.4 comes with a simple dictionary based tokenizer for Thai. With the wrappers, I can use Thai BreakIterator to tokenize Thai sentences returned from StdTokenizer. My design is quite simple. I added <THAI> tag to StandardTokenizer.jj (I rename it to TestStandardTokenizer.jj in my test). The StandardTokenizer then returns a Thai sentence with the tag <THAI>, among other ordinary tokens. Then BreakIteratorTokenTokenizer detects the token and further breaks it down into smaller tokens, which represent actual Thai words. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]