Hi ayyanar, On 01/05/2009 at 12:23 PM, ayyanar wrote: > I need a tokenizer that tokenizes a keyword as follows: Consider an > example "President day" - this should be tokenized as "President day", > "President", "Day" This seems to be a functionality of a keyword > tokenizer and whitespace tokenizer Do we have any tokenizer that does > this job or we need to write a custom one?
A ShingleFilter <http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/shingle/ShingleFilter.html> over a whitespace tokenizer should do the trick. By default, unigrams (individual terms) are output in addition to shingles (token n-grams). Steve
