Hello, I am very confused about what ShingleFilter seems to be doing in Lucene 4.6. What I would like to do is extract all possible bigrams from a sentence. So if the sentence is "This is a dog", I want "This is", "is a ", "a dog".
Here is my code: StringTokenizer itr = new StringTokenizer(theText,"\n"); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(analyzer,2,2); while (itr.hasMoreTokens()) { String theSentence = itr.nextToken(); StringReader reader = new StringReader(theSentence); TokenStream tokenStream = shingleAnalyzer.tokenStream("content", reader); ShingleFilter theFilter = new ShingleFilter(tokenStream); theFilter.setOutputUnigrams(false); CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class); theFilter.reset(); while (theFilter.incrementToken()) { System.out.println(charTermAttribute.toString()); } theFilter.end(); theFilter.close(); } What I see in the output is this: suppose the sentence is "resting comfortably and in no distress". I get the following output: resting resting comfortably resting comfortably comfortably comfortably comfortably _ comfortably _ _ distress _ distress distress So it looks like not only do I not get bigrams, I get spurious 3-grams by repeating words. Could someone please help? Thanks much, Natalia Connolly