[jira] [Comment Edited] (LUCENE-2229) SimpleSpanFragmenter fails to start a new fragment

Lukhnos Liu (JIRA) Mon, 28 Dec 2015 18:45:06 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073306#comment-15073306
 ]


Lukhnos Liu edited comment on LUCENE-2229 at 12/29/15 2:43 AM:
---------------------------------------------------------------

Chatted with [~dsmiley] on #lucene-dev about this bug. This bug still exists in 
Lucene 5.x as well as the current trunk.

I've updated [~elmer.garduno]'s patch and am attaching it here. I tested it 
with the current trunk (r1721808) and the patch will fix the issue. I've been 
using the original patch in my day job (on Lucene 5.3.0), and the patch should 
also work in the 5.x branch.

The bug is a boundary condition that doesn't manifest itself all the time. For 
example, if you loop through the test case in the patch with different fragment 
lengths (say from 30-200), some of the short lengths don't show the issue 
(because the fragment may end at the first highlighted term when the requested 
length is short), but most of the lengths do. If you print out the 
{{isNewFragment()}} method in question, you'll find it looks like this when the 
bug shows up:

{code}
term: instances, position: 0, waitForPos: -1
term: reused, position: 2, waitForPos: -1
term: all, position: 4, waitForPos: -1
term: tokens, position: 5, waitForPos: -1
term: document, position: 8, waitForPos: 7  # uh oh
term: thus, position: 9, waitForPos: 7  # and all the following tokens are 
included in the fragment
term: tokenstream, position: 11, waitForPos: 7
term: filter, position: 12, waitForPos: 7
term: needs, position: 13, waitForPos: 7
term: update, position: 15, waitForPos: 7
...
{code}

Recall that the example document reads {{Attribute instances are reused for all 
tokens of a document. ...}} and "of" and "a" are stop words, and so there is a 
offset-by-2 gap between the positions of "tokens" and "document".

>From the printout, {{position}} is way over {{waitForPos}} from the token 
>"document" on and therefore {{isNewFragment()}} never gets to start a new one.

After the fix, the progression changes:

{code}
term: instances, position: 0, waitForPos: -1
term: reused, position: 2, waitForPos: -1
term: all, position: 4, waitForPos: -1
term: tokens, position: 5, waitForPos: -1
term: document, position: 8, waitForPos: 7  # reset
term: thus, position: 9, waitForPos: -1
term: tokenstream, position: 11, waitForPos: -1
term: filter, position: 12, waitForPos: -1
term: needs, position: 13, waitForPos: -1
term: update, position: 15, waitForPos: -1
...
{code}

So it now gets to reset {{waitForPos}} and fixes the issue.

The updated test can be run with {{ant test 
-Dtests.class=org.apache.lucene.search.highlight.HighlighterTest}}





was (Author: lukhnos):
Chatted with [~dsmiley] on #lucene-dev about this bug. This bug still exists in 
Lucene 5.x as well as the current trunk.

I've updated [~elmer.garduno]'s patch and am attaching it here. I tested it 
with the current trunk (r1721808) and the patch will fix the issue. I've been 
using the original patch in my day job (on Lucene 5.3.0).

The bug is a boundary condition that doesn't manifest itself all the time. For 
example, if you loop through the test case in the patch with different fragment 
lengths, some of the lengths don't have the issue, but a lot of the lengths do. 
If you print out the {{isNewFragment()}} method in question, you'll find it 
looks like this when the bug shows up:

{code}
term: instances, position: 0, waitForPos: -1
term: reused, position: 2, waitForPos: -1
term: all, position: 4, waitForPos: -1
term: tokens, position: 5, waitForPos: -1
term: document, position: 8, waitForPos: 7  # uh oh
term: thus, position: 9, waitForPos: 7  # and all the following tokens are 
included in the fragment
term: tokenstream, position: 11, waitForPos: 7
term: filter, position: 12, waitForPos: 7
term: needs, position: 13, waitForPos: 7
term: update, position: 15, waitForPos: 7
...
{code}

So {{position}} is way over {{waitForPos}} and therefore {{isNewFragment()}} 
never gets to start a new one.

After the fix, the progression changes:

{code}
term: instances, position: 0, waitForPos: -1
term: reused, position: 2, waitForPos: -1
term: all, position: 4, waitForPos: -1
term: tokens, position: 5, waitForPos: -1
term: document, position: 8, waitForPos: 7  # reset
term: thus, position: 9, waitForPos: -1
term: tokenstream, position: 11, waitForPos: -1
term: filter, position: 12, waitForPos: -1
term: needs, position: 13, waitForPos: -1
term: update, position: 15, waitForPos: -1
...
{code}

So it now gets to reset {{waitForPos}} and fixes the issue.




> SimpleSpanFragmenter fails to start a new fragment
> --------------------------------------------------
>
>                 Key: LUCENE-2229
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2229
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: Elmer Garduno
>            Priority: Minor
>         Attachments: LUCENE-2229.patch, LUCENE-2229.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> SimpleSpanFragmenter fails to identify a new fragment when there is more than 
> one stop word after a span is detected. This problem can be observed when the 
> Query contains a PhraseQuery.
> The problem is that the span extends toward the end of the TokenGroup. This 
> is because {{waitForProps = positionSpans.get(i).end + 1;}} and {{position += 
> posIncAtt.getPositionIncrement();}} this generates a value of {{position}} 
> greater than the value of {{waitForProps}} and {{(waitForPos == position)}} 
> never matches.
> {code:title=SimpleSpanFragmenter.java}
>   public boolean isNewFragment() {
>     position += posIncAtt.getPositionIncrement();
>     if (waitForPos == position) {
>       waitForPos = -1;
>     } else if (waitForPos != -1) {
>       return false;
>     }
>     WeightedSpanTerm wSpanTerm = 
> queryScorer.getWeightedSpanTerm(termAtt.term());
>     if (wSpanTerm != null) {
>       List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans();
>       for (int i = 0; i < positionSpans.size(); i++) {
>         if (positionSpans.get(i).start == position) {
>           waitForPos = positionSpans.get(i).end + 1;
>           break;
>         }
>       }
>     }
>    ...
> {code}
> An example is provided in the test case for the following Document and the 
> query *"all tokens"* followed by the words _of a_.
> {panel:title=Document}
> "Attribute instances are reused for *all tokens* _of a_ document. Thus, a 
> TokenStream/-Filter needs to update the appropriate Attribute(s) in 
> incrementToken(). The consumer, commonly the Lucene indexer, consumes the 
> data in the Attributes and then calls incrementToken() again until it retuns 
> false, which indicates that the end of the stream was reached. This means 
> that in each call of incrementToken() a TokenStream/-Filter can safely 
> overwrite the data in the Attribute instances."
> {panel}
> {code:title=HighlighterTest.java}
>  public void testSimpleSpanFragmenter() throws Exception {
>     ...
>     doSearching("\"all tokens\"");
>     maxNumFragmentsRequired = 2;
>     
>     scorer = new QueryScorer(query, FIELD_NAME);
>     highlighter = new Highlighter(this, scorer);
>     for (int i = 0; i < hits.totalHits; i++) {
>       String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
>       TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new 
> StringReader(text));
>       highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20));
>       String result = highlighter.getBestFragments(tokenStream, text,
>           maxNumFragmentsRequired, "...");
>       System.out.println("\t" + result);
>     }
>   }
> {code}
> {panel:title=Result}
> are reused for <B>all</B> <B>tokens</B> of a document. Thus, a 
> TokenStream/-Filter needs to update the appropriate Attribute(s) in 
> incrementToken(). The consumer, commonly the Lucene indexer, consumes the 
> data in the Attributes and then calls incrementToken() again until it retuns 
> false, which indicates that the end of the stream was reached. This means 
> that in each call of incrementToken() a TokenStream/-Filter can safely 
> overwrite the data in the Attribute instances.
> {panel}
> {panel:title=Expected Result}
> for <B>all</B> <B>tokens</B> of a document
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2229) SimpleSpanFragmenter fails to start a new fragment

Reply via email to