[jira] Created: (LUCENE-2229) SimpleSpanFragmenter fails to start a new fragment

Elmer Garduno (JIRA) Wed, 20 Jan 2010 15:03:18 -0800

SimpleSpanFragmenter fails to start a new fragment
--------------------------------------------------


                 Key: LUCENE-2229
                 URL: https://issues.apache.org/jira/browse/LUCENE-2229
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/highlighter
            Reporter: Elmer Garduno


SimpleSpanFragmenter fails to identify a new fragment when there is more than 
one stop word after a span is detected. This problem can be observed when the 
Query contains a PhraseQuery.

The problem is that the span extends toward the end of the TokenGroup. This is 
because {{waitForProps = positionSpans.get(i).end + 1;}} and {{position += 
posIncAtt.getPositionIncrement();}} this generates a value of {{position}} 
greater than the value of {{waitForProps}} and {{(waitForPos == position)}} 
never matches.

{code:title=SimpleSpanFragmenter.java}
  public boolean isNewFragment() {
    position += posIncAtt.getPositionIncrement();

    if (waitForPos == position) {
      waitForPos = -1;
    } else if (waitForPos != -1) {
      return false;
    }

    WeightedSpanTerm wSpanTerm = 
queryScorer.getWeightedSpanTerm(termAtt.term());

    if (wSpanTerm != null) {
      List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans();

      for (int i = 0; i < positionSpans.size(); i++) {
        if (positionSpans.get(i).start == position) {
          waitForPos = positionSpans.get(i).end + 1;
          break;
        }
      }
    }
   ...
{code}

An example is provided in the test case for the following Document and the 
query *"all tokens"* followed by the words _of a_.

{panel:title=Document}
"Attribute instances are reused for *all tokens* _of a_ document. Thus, a 
TokenStream/-Filter needs to update the appropriate Attribute(s) in 
incrementToken(). The consumer, commonly the Lucene indexer, consumes the data 
in the Attributes and then calls incrementToken() again until it retuns false, 
which indicates that the end of the stream was reached. This means that in each 
call of incrementToken() a TokenStream/-Filter can safely overwrite the data in 
the Attribute instances."
{panel}

{code:title=HighlighterTest.java}

 public void testSimpleSpanFragmenter() throws Exception {

    ...

    doSearching("\"all tokens\"");

    maxNumFragmentsRequired = 2;
    
    scorer = new QueryScorer(query, FIELD_NAME);
    highlighter = new Highlighter(this, scorer);

    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new 
StringReader(text));

      highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20));

      String result = highlighter.getBestFragments(tokenStream, text,
          maxNumFragmentsRequired, "...");
      System.out.println("\t" + result);

    }
  }
{code}


{panel:title=Result}
are reused for <B>all</B> <B>tokens</B> of a document. Thus, a 
TokenStream/-Filter needs to update the appropriate Attribute(s) in 
incrementToken(). The consumer, commonly the Lucene indexer, consumes the data 
in the Attributes and then calls incrementToken() again until it retuns false, 
which indicates that the end of the stream was reached. This means that in each 
call of incrementToken() a TokenStream/-Filter can safely overwrite the data in 
the Attribute instances.
{panel}

{panel:title=Expected Result}
for <B>all</B> <B>tokens</B> of a document
{panel}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2229) SimpleSpanFragmenter fails to start a new fragment

Reply via email to