I don't know any Asian languages but from earlier experimentations, I
remember that some time bigram tokenization could hurt matching, e.g.:

w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
miss a search for w2. w1 w2 w3 would work better.

--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> Che Dong wrote:
> > 2. CJK support: 
> >        2.1 sigram based(no word segment just use one character as a
> token):  modified from StandardTokenizer.java
> >    
>
http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=330905
> >     CJKTokenizer for Asia language(Chinese Japanese Korean) Word
> Segment
> >    
>
http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgId=450266
> >     StandardTokenizer with sigram based CJK Support
> > 
> >     2.2 bigram based word segment: modified from SimpleTokenizer to
> CJKTokenizer.java
> >    
>
http://www.mail-archive.com/[email protected]/msg01220.html
> 
> I think it would be great to have some support for asian languages
> built 
> into Lucene.  Which of these approaches do you think is best?  I like
> 
> the idea of a StandardTokenizer or SimpleTokenizer that automatically
> 
> provides this via bigrams.  What do others think?
> 
> Doug
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> 


=====
__________________________________
[EMAIL PROTECTED] -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to