Unexpected terms are highlighted within nested SpanQuery instances
------------------------------------------------------------------

                 Key: LUCENE-2287
                 URL: https://issues.apache.org/jira/browse/LUCENE-2287
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/highlighter
    Affects Versions: 2.9.1
         Environment: Linux, Solaris, Windows
            Reporter: Michael Goddard
            Priority: Minor


I haven't yet been able to resolve why I'm seeing spurious highlighting in 
nested SpanQuery instances.  Briefly, the issue is illustrated by the second 
instance of "Lucene" being highlighted in the test below, when it doesn't 
satisfy the inner span.  There's been some discussion about this on the 
java-dev list, and I'm opening this issue now because I have made some initial 
progress on this.

This new test, added to the  HighlighterTest class in lucene_2_9_1, illustrates 
this:

/*
 * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
 */
public void testHighlightingNestedSpans2() throws Exception {

  String theText = "The Lucene was made by Doug Cutting and Lucene great Hadoop 
was"; // Problem
  //String theText = "The Lucene was made by Doug Cutting and the great Hadoop 
was"; // Works okay

  String fieldName = "SOME_FIELD_NAME";

  SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
    new SpanTermQuery(new Term(fieldName, "lucene")),
    new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);

  Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
    new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);

  String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and 
Lucene great <B>Hadoop</B> was";
  //String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting and 
the great <B>Hadoop</B> was";

  String observed = highlightField(query, fieldName, theText);
  System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + 
observed);

  assertEquals("Why is that second instance of the term \"Lucene\" 
highlighted?", expected, observed);
}

Is this an issue that's arisen before?  I've been reading through the source to 
QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and 
NearSpansOrdered, but haven't found the solution yet.  Initially, I thought 
that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be 
called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me 
too far.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to