[
https://issues.apache.org/jira/browse/LUCENE-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092349#comment-15092349
]
ASF subversion and git services commented on LUCENE-2229:
---------------------------------------------------------
Commit 1724096 from [~jpountz] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1724096 ]
LUCENE-2229: Move CHANGES entry to 5.4.1.
> SimpleSpanFragmenter fails to start a new fragment
> --------------------------------------------------
>
> Key: LUCENE-2229
> URL: https://issues.apache.org/jira/browse/LUCENE-2229
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/highlighter
> Reporter: Elmer Garduno
> Assignee: David Smiley
> Priority: Minor
> Fix For: 5.5, 5.4.1
>
> Attachments: LUCENE-2229.patch, LUCENE-2229.patch, LUCENE-2229.patch
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> SimpleSpanFragmenter fails to identify a new fragment when there is more than
> one stop word after a span is detected. This problem can be observed when the
> Query contains a PhraseQuery.
> The problem is that the span extends toward the end of the TokenGroup. This
> is because {{waitForProps = positionSpans.get(i).end + 1;}} and {{position +=
> posIncAtt.getPositionIncrement();}} this generates a value of {{position}}
> greater than the value of {{waitForProps}} and {{(waitForPos == position)}}
> never matches.
> {code:title=SimpleSpanFragmenter.java}
> public boolean isNewFragment() {
> position += posIncAtt.getPositionIncrement();
> if (waitForPos == position) {
> waitForPos = -1;
> } else if (waitForPos != -1) {
> return false;
> }
> WeightedSpanTerm wSpanTerm =
> queryScorer.getWeightedSpanTerm(termAtt.term());
> if (wSpanTerm != null) {
> List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans();
> for (int i = 0; i < positionSpans.size(); i++) {
> if (positionSpans.get(i).start == position) {
> waitForPos = positionSpans.get(i).end + 1;
> break;
> }
> }
> }
> ...
> {code}
> An example is provided in the test case for the following Document and the
> query *"all tokens"* followed by the words _of a_.
> {panel:title=Document}
> "Attribute instances are reused for *all tokens* _of a_ document. Thus, a
> TokenStream/-Filter needs to update the appropriate Attribute(s) in
> incrementToken(). The consumer, commonly the Lucene indexer, consumes the
> data in the Attributes and then calls incrementToken() again until it retuns
> false, which indicates that the end of the stream was reached. This means
> that in each call of incrementToken() a TokenStream/-Filter can safely
> overwrite the data in the Attribute instances."
> {panel}
> {code:title=HighlighterTest.java}
> public void testSimpleSpanFragmenter() throws Exception {
> ...
> doSearching("\"all tokens\"");
> maxNumFragmentsRequired = 2;
>
> scorer = new QueryScorer(query, FIELD_NAME);
> highlighter = new Highlighter(this, scorer);
> for (int i = 0; i < hits.totalHits; i++) {
> String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
> TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new
> StringReader(text));
> highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20));
> String result = highlighter.getBestFragments(tokenStream, text,
> maxNumFragmentsRequired, "...");
> System.out.println("\t" + result);
> }
> }
> {code}
> {panel:title=Result}
> are reused for <B>all</B> <B>tokens</B> of a document. Thus, a
> TokenStream/-Filter needs to update the appropriate Attribute(s) in
> incrementToken(). The consumer, commonly the Lucene indexer, consumes the
> data in the Attributes and then calls incrementToken() again until it retuns
> false, which indicates that the end of the stream was reached. This means
> that in each call of incrementToken() a TokenStream/-Filter can safely
> overwrite the data in the Attribute instances.
> {panel}
> {panel:title=Expected Result}
> for <B>all</B> <B>tokens</B> of a document
> {panel}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]