either remove the shingleanalyzer or the additional filter... On Wed, Apr 2, 2014 at 2:44 PM, Natalia Connolly <natalia.v.conno...@gmail.com> wrote: > Hi Robert, > > No, I did not… I just needed the filter to stop it from outputting > unigrams; otherwise I was getting "This", "this is", "is", "is a ", and so > on. Is there another way I could do it? > > Thank you, > > Natalia > > > > On Wed, Apr 2, 2014 at 2:40 PM, Robert Muir <rcm...@gmail.com> wrote: > >> Did you really mean to shingle twice (shingleanalyzerwrapper just >> wraps the analyzer with a shinglefilter, then the code wraps that with >> another shinglefilter again) ? >> >> On Wed, Apr 2, 2014 at 1:42 PM, Natalia Connolly >> <natalia.v.conno...@gmail.com> wrote: >> > Hello, >> > >> > I am very confused about what ShingleFilter seems to be doing in >> Lucene >> > 4.6. What I would like to do is extract all possible bigrams from a >> > sentence. So if the sentence is "This is a dog", I want "This is", "is a >> > ", "a dog". >> > >> > Here is my code: >> > >> > StringTokenizer itr = new StringTokenizer(theText,"\n"); >> > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); >> > ShingleAnalyzerWrapper shingleAnalyzer = new >> > ShingleAnalyzerWrapper(analyzer,2,2); >> > >> > while (itr.hasMoreTokens()) { >> > >> > String theSentence = itr.nextToken(); >> > StringReader reader = new StringReader(theSentence); >> > TokenStream tokenStream = shingleAnalyzer.tokenStream("content", >> > reader); >> > ShingleFilter theFilter = new ShingleFilter(tokenStream); >> > theFilter.setOutputUnigrams(false); >> > >> > CharTermAttribute charTermAttribute = >> > theFilter.addAttribute(CharTermAttribute.class); >> > >> > theFilter.reset(); >> > >> > while (theFilter.incrementToken()) { >> > >> > System.out.println(charTermAttribute.toString()); >> > >> > } >> > >> > theFilter.end(); >> > theFilter.close(); >> > } >> > >> > >> > What I see in the output is this: suppose the sentence is "resting >> > comfortably and in no distress". I get the following output: >> > >> > resting resting comfortably >> > resting comfortably comfortably >> > comfortably comfortably _ >> > comfortably _ _ distress >> > _ distress distress >> > >> > So it looks like not only do I not get bigrams, I get spurious 3-grams >> > by repeating words. Could someone please help? >> > >> > Thanks much, >> > >> > Natalia Connolly >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org