I don't know any Asian languages but from earlier experimentations, I remember that some time bigram tokenization could hurt matching, e.g.:
w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would miss a search for w2. w1 w2 w3 would work better. --- Doug Cutting <[EMAIL PROTECTED]> wrote: > Che Dong wrote: > > 2. CJK support: > > 2.1 sigram based(no word segment just use one character as a > token): modified from StandardTokenizer.java > > > http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=330905 > > CJKTokenizer for Asia language(Chinese Japanese Korean) Word > Segment > > > http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=450266 > > StandardTokenizer with sigram based CJK Support > > > > 2.2 bigram based word segment: modified from SimpleTokenizer to > CJKTokenizer.java > > > http://www.mail-archive.com/[email protected]/msg01220.html > > I think it would be great to have some support for asian languages > built > into Lucene. Which of these approaches do you think is best? I like > > the idea of a StandardTokenizer or SimpleTokenizer that automatically > > provides this via bigrams. What do others think? > > Doug > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > ===== __________________________________ [EMAIL PROTECTED] -- http://www.lissus.com __________________________________________________ Do you Yahoo!? Yahoo! News - Today's headlines http://news.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
