Hi,

I have written some simple adaptor/wrapper classes for
java.text.BreakIterator, available in jdk 1.4 and
later. I also created a ThaiAnalyzer class based on
those wrappers.

Thai is one of those languages that has no whitespace
between words. Because of this, Lucene
StandardTokenizer can't tokenize a Thai sentence and
return the whole sentence as a token. 

JDK 1.4 comes with a simple dictionary based tokenizer
for Thai. With the wrappers, I can use Thai
BreakIterator to tokenize Thai sentences returned from
StdTokenizer.

My design is quite simple. I added <THAI> tag to
StandardTokenizer.jj (I rename it to
TestStandardTokenizer.jj in my test). The
StandardTokenizer then returns a Thai sentence with
the tag <THAI>, among other ordinary tokens. Then
BreakIteratorTokenTokenizer detects the token and
further breaks it down into smaller tokens, which
represent actual Thai words.

The source code is available here
http://pichai.netfirms.com/thai_analyzer.zip

I'm not sure if this code is worth being part of
Lucene. If it is, I can modify the code as you guys
suggest, and contribute it to Lucene project.

Thanks,
Pichai

__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to