[
https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947050#comment-14947050
]
Scott Stults commented on LUCENE-2287:
--------------------------------------
LUCENE-5455 has a few tests that should be added here once this patch is
cleaned up.
There are a few hurdles in cleaning this up though. The first is that this
patch was based on a *really* old version and I can't seem to find anything in
SVN or git older than 3.1. The second is that Spans are quite a bit different.
By the way, I've tried the unit tests in both issues and they still fail in
5.3+. The issue seems to be in
WeightedSpanTermExtractor.extractWeightedSpanTerms(). It first builds a list of
all position spans, and then it adds all of those position spans to a map of
the term irrespective of whether that term was used in that position span.
Mike's patch addresses this by keeping a separate list of position spans per
term.
The thing that's *not* fixed by the patch is the notion of when to stop
recursing into the spans. I tried several methods of inspecting and classifying
the spans but I either end up with too many positions (resulting in too many
term highlights) or too few.
[~ romseygeek], why is this so hard? Can't we just use the same methods the
searcher uses? Maybe create a new collector and re-search the incoming doc?
> Unexpected terms are highlighted within nested SpanQuery instances
> ------------------------------------------------------------------
>
> Key: LUCENE-2287
> URL: https://issues.apache.org/jira/browse/LUCENE-2287
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Affects Versions: 2.9.1
> Environment: Linux, Solaris, Windows
> Reporter: Michael Goddard
> Priority: Minor
> Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch,
> LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> I haven't yet been able to resolve why I'm seeing spurious highlighting in
> nested SpanQuery instances. Briefly, the issue is illustrated by the second
> instance of "Lucene" being highlighted in the test below, when it doesn't
> satisfy the inner span. There's been some discussion about this on the
> java-dev list, and I'm opening this issue now because I have made some
> initial progress on this.
> This new test, added to the HighlighterTest class in lucene_2_9_1,
> illustrates this:
> /*
> * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
> */
> public void testHighlightingNestedSpans2() throws Exception {
> String theText = "The Lucene was made by Doug Cutting and Lucene great
> Hadoop was"; // Problem
> //String theText = "The Lucene was made by Doug Cutting and the great
> Hadoop was"; // Works okay
> String fieldName = "SOME_FIELD_NAME";
> SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
> new SpanTermQuery(new Term(fieldName, "lucene")),
> new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);
> Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
> new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);
> String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and
> Lucene great <B>Hadoop</B> was";
> //String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and
> the great <B>Hadoop</B> was";
> String observed = highlightField(query, fieldName, theText);
> System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" +
> observed);
> assertEquals("Why is that second instance of the term \"Lucene\"
> highlighted?", expected, observed);
> }
> Is this an issue that's arisen before? I've been reading through the source
> to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and
> NearSpansOrdered, but haven't found the solution yet. Initially, I thought
> that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should
> be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't
> get me too far.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]