Hi,
We have used Lucene to index Chinese for our KM system. We have added
a Chinese text analyzer. However, the current implementation is a hack
to glue everything together.
We are rearchitecting this and like to get some suggestion from the
groups.
Before this, let me explain first the difficulty and problems we have
in the analysis of Chinese text. Unlike English, Chinese and most Asian
languages like Japanese, Korean and Thai, do not have clear word
boundary. Additional component called segmentor needs to be implemented
to seperate the string into list of words.
The question would be where this segmentor should be. Currently, we
are designing it to be put below the analyzer. So, each token gets
passed to the Analyzer would be a either a word or symbols tokens. That
can be applied filtering.
I think this is a good place to put the segmentor.
In additional to our Chinese segmentor, IBM's ICU4J has a
BreakIterator which has both RuleBase and DictionaryBase break iterator
and currently support Thai.
Functionally, the BreakIterator is very closely related to the
segmentor. We are looking into the possibility to integrate our
segmentor into the BreakIterator framework.
If we have Lucene to use ICU4J at the bottom, we could get Lucene to
become a search engine that is capable handle as many languages as
supported by ICU4J.
I'd like to know what the group think of this idea.
Thanks
ICU4J:
http://oss.software.ibm.com/developerworks/opensource/icu4j/index.html
BreakIterator:
http://oss.software.ibm.com/icu4j/doc/com/ibm/text/BreakIterator.html
David Li
DigitalSesame
_______________________________________________
Lucene-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-dev