Bigrams for CJK with ICUTokenizer ?

Burton-West, Tom Fri, 04 Feb 2011 09:47:32 -0800

Hello all,

We are using the ICUTokenizer because we have documents in about 400 different 
languages.   We are also setting autoGeneratePhraseQueries to false so that CJK 
and other languages that don't use space to separate words won't get tokenized 
properly by the ICUTokenizer and then the tokens automatically searched as a 
phrase.


 The ICUTokenizer emits unigrams for Chinese (HAN). We would prefer to use 
overlapping bigrams as in the CJKAnalyzer.   Is it possible to configure the 
ICUTokenizer to emit overlapping bigrams?

Alternatively, is there some way to put some filter in the filter chain after 
the ICUTokenizer that would produce overlapping bigrams for CJK?

Tom Burton-West

Bigrams for CJK with ICUTokenizer ?

Reply via email to