DO NOT REPLY [Bug 27182] - [PATCH] Thai Analysis Enhancement

bugzilla Tue, 24 Feb 2004 05:13:26 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27182>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27182

[PATCH] Thai Analysis Enhancement





------- Additional Comments From [EMAIL PROTECTED]  2004-02-24 13:13 -------
A note from the original contribution email, for archiving purposes:

Thai is one of those languages that has no whitespace
between words. Because of this, Lucene
StandardTokenizer can't tokenize a Thai sentence and
return the whole sentence as a token. 

JDK 1.4 comes with a simple dictionary based tokenizer
for Thai. With the wrappers, I can use Thai
BreakIterator to tokenize Thai sentences returned from
StdTokenizer.

My design is quite simple. I added <THAI> tag to
StandardTokenizer.jj (I rename it to
TestStandardTokenizer.jj in my test). The
StandardTokenizer then returns a Thai sentence with
the tag <THAI>, among other ordinary tokens. Then
BreakIteratorTokenTokenizer detects the token and
further breaks it down into smaller tokens, which
represent actual Thai words.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 27182] - [PATCH] Thai Analysis Enhancement

Reply via email to