+1 for a "completely accurate" (each snippet shown matches the query) and fast highlighter, but it's a real challenge because you need a clean way to recursively iterate all positions for any (even non-positional) queries (what LUCENE-2878 will give us). To properly handle your (+A +B) (+C +D) example, you'd need BooleanQuery to participate in enumerating the positions...
Yes, I think WSTE could just pull from the postings. Mike McCandless http://blog.mikemccandless.com On Fri, Oct 10, 2014 at 12:38 AM, [email protected] <[email protected]> wrote: > I’m working on making highlighting both accurate and fast. By “accurate”, I > mean the highlights need to accurately reflect a match given the query and > various possible query types (to include SpanQueries and MultiTermQueries > and obviously phrase queries and the usual suspects). The fastest > highlighter we’ve got in Lucene is the PostingsHighlighter but it throws out > any positional nature in the query and can highlight more inaccurately than > the other two highlighters. The most accurate is the default highlighter, > although I can see some simplifications it makes that could lead to > inaccuracies. > > The default highlighter’s “WeightedSpanTermExtractor” is interesting — it > uses a MemoryIndex built from re-analyzing the text, and it executes the > query against this mini index; kind of. A recent experiment I did was to > have the MemoryIndex essentially wrap the “Terms” from term vectors. It > works and saves memory, although, at least for large docs (which I’m > optimizing for) the real performance hit is in un-inverting the TokenStream > in TokenSources to include sorting the thousands of tokens -- assuming you > index term vectors of course. But with my attention now on the > PostingsHighlighter (because it’s the fastest and offsets are way cheaper > than term vectors), I believe WeightedSpanTermExtractor could simply use > Lucene’s actual IndexReader — no? It seems so obvious to me now I wonder > why it wasn’t done this way in the first place — all WSTE has to do is > advance() to the document being highlighted for applicable terms. Am I > overlooking something? > > WeightedSpanTermExtractor is somewhat accurate but my reading of its source > shows it takes short-cuts I’d like to eliminate. For example if the query > is “(A && B) || (C && D)” and if the document doesn’t have ‘D’ then it > should ideally NOT highlight ‘C’ in this document, just ‘A’ and ‘B’. I > think I can solve that using Scorers.getChildScorers to see which scorers > (and thus queries) actually matched. Another example is that it views > SpanQueries at the top level only and records the entire span for all terms > it is comprised of. So if you had a couple Phrase SpanQueries (actually > ordered 0-slop SpanNearQueries) joined by a SpanNearQuery to be within ~50 > positions of each other, I believe it would highlight any other occurrence > of the words involved in-between the sub-SpanQueries. This looks hard to > solve but I think for starters, SpanScorer needs a getter for the Spans > instance, and furthermore Spans needs getChildSpans() just as Scorers expose > child scorers. I could see myself relaxing this requirement because of it’s > complexity and simply highlighting the entire span, even if it could be a > big highlight. > > Perhaps the “Nuke Spans” effort might make this all much easier but I > haven’t looked yet because that’s still not done yet. It’s encouraging to > see Alan making recent progress there. > > Any thoughts about any of this, guys? > > p.s. When I’m done, I expect to have no problem getting open-source > permission from the sponsor commissioning this effort. > > ~ David Smiley > Freelance Apache Lucene/Solr Search Consultant/Developer > http://www.linkedin.com/in/davidwsmiley --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
