I am using lucene 2.9.1 and I was trying to understand the ShingleFilter and 
wrote the code below.

String test = "please divide this sentence";
Tokenizer wsTokenizer = new WhitespaceTokenizer(new StringReader(test));
ShingleFilter filter = new ShingleFilter(wsTokenizer, 3);
filter.setOutputUnigrams(false);
        
TermAttribute termAtt = (TermAttribute) 
filter.getAttribute(TermAttribute.class);

while (filter.incrementToken())            System.out.println(termAtt.term());

I noticed that if I set outputUnigrams to false it gives me the same output for 
maxShingleSize=2 and maxShingleSize=3.

please divide 
divide this 
this sentence 

when i set maxShingleSize to 4 output is:

please divide 
please divide this sentence 
divide this 
this sentence 

I was expecting the output as follows with maxShingleSize=3 and 
outputUnigrams=false :

please divide this 
divide this sentence 

Am I missing something or this is the expected behavior?

I checked source code of ShingleFilterTest (lucene 3.0.0) and see that 
TRI_GRAM_TOKENS are tested with only outputUnigrams=true but not with 
outputUnigrams=false.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to