Hi, I have written some simple adaptor/wrapper classes for java.text.BreakIterator, available in jdk 1.4 and later. I also created a ThaiAnalyzer class based on those wrappers.
Thai is one of those languages that has no whitespace between words. Because of this, Lucene StandardTokenizer can't tokenize a Thai sentence and return the whole sentence as a token. JDK 1.4 comes with a simple dictionary based tokenizer for Thai. With the wrappers, I can use Thai BreakIterator to tokenize Thai sentences returned from StdTokenizer. My design is quite simple. I added <THAI> tag to StandardTokenizer.jj (I rename it to TestStandardTokenizer.jj in my test). The StandardTokenizer then returns a Thai sentence with the tag <THAI>, among other ordinary tokens. Then BreakIteratorTokenTokenizer detects the token and further breaks it down into smaller tokens, which represent actual Thai words. The source code is available here http://pichai.netfirms.com/thai_analyzer.zip I'm not sure if this code is worth being part of Lucene. If it is, I can modify the code as you guys suggest, and contribute it to Lucene project. Thanks, Pichai __________________________________ Do you Yahoo!? Yahoo! Finance: Get your refund fast by filing online. http://taxes.yahoo.com/filing.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]