I’m working on making highlighting both accurate and fast. By “accurate”, I mean the highlights need to accurately reflect a match given the query and various possible query types (to include SpanQueries and MultiTermQueries and obviously phrase queries and the usual suspects). The fastest highlighter we’ve got in Lucene is the PostingsHighlighter but it throws out any positional nature in the query and can highlight more inaccurately than the other two highlighters. The most accurate is the default highlighter, although I can see some simplifications it makes that could lead to inaccuracies.
The default highlighter’s “WeightedSpanTermExtractor” is interesting — it uses a MemoryIndex built from re-analyzing the text, and it executes the query against this mini index; kind of. A recent experiment I did was to have the MemoryIndex essentially wrap the “Terms” from term vectors. It works and saves memory, although, at least for large docs (which I’m optimizing for) the real performance hit is in un-inverting the TokenStream in TokenSources to include sorting the thousands of tokens -- assuming you index term vectors of course. But with my attention now on the PostingsHighlighter (because it’s the fastest and offsets are way cheaper than term vectors), I believe WeightedSpanTermExtractor could simply use Lucene’s actual IndexReader — no? It seems so obvious to me now I wonder why it wasn’t done this way in the first place — all WSTE has to do is advance() to the document being highlighted for applicable terms. Am I overlooking something? WeightedSpanTermExtractor is somewhat accurate but my reading of its source shows it takes short-cuts I’d like to eliminate. For example if the query is “(A && B) || (C && D)” and if the document doesn’t have ‘D’ then it should ideally NOT highlight ‘C’ in this document, just ‘A’ and ‘B’. I think I can solve that using Scorers.getChildScorers to see which scorers (and thus queries) actually matched. Another example is that it views SpanQueries at the top level only and records the entire span for all terms it is comprised of. So if you had a couple Phrase SpanQueries (actually ordered 0-slop SpanNearQueries) joined by a SpanNearQuery to be within ~50 positions of each other, I believe it would highlight any other occurrence of the words involved in-between the sub-SpanQueries. This looks hard to solve but I think for starters, SpanScorer needs a getter for the Spans instance, and furthermore Spans needs getChildSpans() just as Scorers expose child scorers. I could see myself relaxing this requirement because of it’s complexity and simply highlighting the entire span, even if it could be a big highlight. Perhaps the “Nuke Spans” effort might make this all much easier but I haven’t looked yet because that’s still not done yet. It’s encouraging to see Alan making recent progress there. Any thoughts about any of this, guys? p.s. When I’m done, I expect to have no problem getting open-source permission from the sponsor commissioning this effort. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley
