either remove the shingleanalyzer or the additional filter...

On Wed, Apr 2, 2014 at 2:44 PM, Natalia Connolly
<natalia.v.conno...@gmail.com> wrote:
> Hi Robert,
>
>    No, I did not… I just needed the filter to stop it from outputting
> unigrams; otherwise I was getting "This", "this is", "is", "is a ", and so
> on.   Is there another way I could do it?
>
>    Thank you,
>
>    Natalia
>
>
>
> On Wed, Apr 2, 2014 at 2:40 PM, Robert Muir <rcm...@gmail.com> wrote:
>
>> Did you really mean to shingle twice (shingleanalyzerwrapper just
>> wraps the analyzer with a shinglefilter, then the code wraps that with
>> another shinglefilter again) ?
>>
>> On Wed, Apr 2, 2014 at 1:42 PM, Natalia Connolly
>> <natalia.v.conno...@gmail.com> wrote:
>> > Hello,
>> >
>> >    I am very confused about what ShingleFilter seems to be doing in
>> Lucene
>> > 4.6.  What I would like to do is extract all possible bigrams from a
>> > sentence.  So if the sentence is "This is a dog", I want "This is", "is a
>> > ", "a dog".
>> >
>> >     Here is my code:
>> >
>> >    StringTokenizer itr = new StringTokenizer(theText,"\n");
>> >    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
>> >    ShingleAnalyzerWrapper shingleAnalyzer = new
>> > ShingleAnalyzerWrapper(analyzer,2,2);
>> >
>> >    while (itr.hasMoreTokens()) {
>> >
>> >     String theSentence = itr.nextToken();
>> >     StringReader reader = new StringReader(theSentence);
>> >     TokenStream tokenStream = shingleAnalyzer.tokenStream("content",
>> > reader);
>> >     ShingleFilter theFilter = new ShingleFilter(tokenStream);
>> >     theFilter.setOutputUnigrams(false);
>> >
>> >     CharTermAttribute charTermAttribute =
>> > theFilter.addAttribute(CharTermAttribute.class);
>> >
>> >     theFilter.reset();
>> >
>> >      while (theFilter.incrementToken()) {
>> >
>> >                 System.out.println(charTermAttribute.toString());
>> >
>> >      }
>> >
>> >      theFilter.end();
>> >      theFilter.close();
>> >   }
>> >
>> >
>> >    What I see in the output is this: suppose the sentence is "resting
>> > comfortably and in no distress".  I get the following output:
>> >
>> > resting resting comfortably
>> > resting comfortably comfortably
>> > comfortably comfortably _
>> > comfortably _ _ distress
>> > _ distress distress
>> >
>> >    So it looks like not only do I not get bigrams, I get spurious 3-grams
>> > by repeating words.  Could someone please help?
>> >
>> >     Thanks much,
>> >
>> >     Natalia Connolly
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to