I am using lucene 2.9.1 and I was trying to understand the ShingleFilter and wrote the code below.
String test = "please divide this sentence"; Tokenizer wsTokenizer = new WhitespaceTokenizer(new StringReader(test)); ShingleFilter filter = new ShingleFilter(wsTokenizer, 3); filter.setOutputUnigrams(false); TermAttribute termAtt = (TermAttribute) filter.getAttribute(TermAttribute.class); while (filter.incrementToken()) System.out.println(termAtt.term()); I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2 and maxShingleSize=3. please divide divide this this sentence when i set maxShingleSize to 4 output is: please divide please divide this sentence divide this this sentence I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : please divide this divide this sentence Am I missing something or this is the expected behavior? I checked source code of ShingleFilterTest (lucene 3.0.0) and see that TRI_GRAM_TOKENS are tested with only outputUnigrams=true but not with outputUnigrams=false. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org