Hey Michael - this is currently just a limitation of the Span highlighter. It does a bit of fudging when determining what a good position is - if a term from the text is found within the span of a spanquery it is in (no matter how deeply nested), the highlighter makes a guess that the term should be highlighted - this is because we don't have the actual positions of each term - just the positions of the start and end of the span. In almost all cases this works as you would expect - but when nesting spans like this, you can get spurious results within the overall span.
So your idea that we should recurse into the Span is on the right track - but it just gets fairly complicated quick. Consider SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - if we recurse in an grab the first SpanNear (mark, miller, 3), we can correctly highlight that - but then we will handle lucene by itself - so all lucene terms will be hit rather than the one within 4 of the first span. So you have to deal with SpanOr, SpanNear, SpanNot recursively, but then also handle when they are linked, either with each other or with a SpanTerm - and uh - its gets hard real fast. Hence the fuzziness that goes on now. There may be something we can do to improve things in the future, but its kind of an accepted limitation at the moment - prob something we should add some doc about. - Mark Goddard, Michael J. wrote: > > Hello, > > I initially posted a version of this question to java-user, but think > it's more of a java-dev question. I haven't yet been able to resolve > why I'm seeing spurious highlighting in nested SpanQuery instances. > To illustrate this, I added the code below to the HighlighterTest > class in lucene_2_9_1: > > /* > * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ > */ > public void testHighlightingNestedSpans2() throws Exception { > > String theText = "The Lucene was made by Doug Cutting and Lucene > great Hadoop was"; // Problem > //String theText = "The Lucene was made by Doug Cutting and the > great Hadoop was"; // Works okay > > String fieldName = "SOME_FIELD_NAME"; > > SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { > new SpanTermQuery(new Term(fieldName, "lucene")), > new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true); > > Query query = new SpanNearQuery(new SpanQuery[] { spanNear, > new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true); > > String expected = "The <B>Lucene</B> was made by <B>Doug</B> Cutting > and Lucene great <B>Hadoop</B> was"; > //String expected = "The <B>Lucene</B> was made by <B>Doug</B> > Cutting and the great <B>Hadoop</B> was"; > > String observed = highlightField(query, fieldName, theText); > System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" > + observed); > > assertEquals("Why is that second instance of the term \"Lucene\" > highlighted?", expected, observed); > } > > Is this an issue that's arisen before? I've been reading through the > source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, > Spans, and NearSpansOrdered, but haven't found the solution yet. > Initially, I thought that the extractWeightedSpanTerms method in > WeightedSpanTermExtractor should be called on each clause of a > SpanNearQuery or SpanOrQuery, but that didn't get me too far. > > Any suggestions are welcome. > > Thanks. > > Mike > -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org