[jira] [Commented] (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances

Scott Stults (JIRA) Wed, 07 Oct 2015 08:42:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947050#comment-14947050
 ]


Scott Stults commented on LUCENE-2287:
--------------------------------------

LUCENE-5455 has a few tests that should be added here once this patch is 
cleaned up. 

There are a few hurdles in cleaning this up though. The first is that this 
patch was based on a *really* old version and I can't seem to find anything in 
SVN or git older than 3.1. The second is that Spans are quite a bit different.

By the way, I've tried the unit tests in both issues and they still fail in 
5.3+. The issue seems to be in 
WeightedSpanTermExtractor.extractWeightedSpanTerms(). It first builds a list of 
all position spans, and then it adds all of those position spans to a map of 
the term irrespective of whether that term was used in that position span. 
Mike's patch addresses this by keeping a separate list of position spans per 
term.

The thing that's *not* fixed by the patch is the notion of when to stop 
recursing into the spans. I tried several methods of inspecting and classifying 
the spans but I either end up with too many positions (resulting in too many 
term highlights) or too few. 

[~ romseygeek], why is this so hard? Can't we just use the same methods the 
searcher uses? Maybe create a new collector and re-search the incoming doc?

> Unexpected terms are highlighted within nested SpanQuery instances
> ------------------------------------------------------------------
>
>                 Key: LUCENE-2287
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2287
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 2.9.1
>         Environment: Linux, Solaris, Windows
>            Reporter: Michael Goddard
>            Priority: Minor
>         Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, 
> LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> I haven't yet been able to resolve why I'm seeing spurious highlighting in 
> nested SpanQuery instances.  Briefly, the issue is illustrated by the second 
> instance of "Lucene" being highlighted in the test below, when it doesn't 
> satisfy the inner span.  There's been some discussion about this on the 
> java-dev list, and I'm opening this issue now because I have made some 
> initial progress on this.
> This new test, added to the  HighlighterTest class in lucene_2_9_1, 
> illustrates this:
> /*
>  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
>  */
> public void testHighlightingNestedSpans2() throws Exception {
>   String theText = "The Lucene was made by Doug Cutting and Lucene great 
> Hadoop was"; // Problem
>   //String theText = "The Lucene was made by Doug Cutting and the great 
> Hadoop was"; // Works okay
>   String fieldName = "SOME_FIELD_NAME";
>   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
>     new SpanTermQuery(new Term(fieldName, "lucene")),
>     new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
>   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
>     new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);
>   String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and 
> Lucene great <B>Hadoop</B> was";
>   //String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and 
> the great <B>Hadoop</B> was";
>   String observed = highlightField(query, fieldName, theText);
>   System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + 
> observed);
>   assertEquals("Why is that second instance of the term \"Lucene\" 
> highlighted?", expected, observed);
> }
> Is this an issue that's arisen before?  I've been reading through the source 
> to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and 
> NearSpansOrdered, but haven't found the solution yet.  Initially, I thought 
> that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should 
> be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't 
> get me too far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances

Reply via email to