Spot on Walter. “trust in the engine” is precisely how it was put to me for a technical user-community that I’m aiming this at. When they see “wrong” highlighting, they lose trust. One might argue it’s a matter of educating the user but I think it’s a reasonable requirement (a reasonable thing to want of Lucene).
~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Fri, Oct 10, 2014 at 12:55 PM, Walter Underwood <[email protected]> wrote: > I think of snippets and highlighting as explaining to the end user why the > engine decided this was relevant. This tends to increase the user’s trust > in the engine even when the results are not relevant. > > wunder > Walter Underwood > [email protected] > http://observer.wunderwood.org/ > > > On Oct 10, 2014, at 9:37 AM, Uwe Schindler <[email protected]> wrote: > > Hi, > > > I’m confused how inaccuracy is a feature, but nevertheless I appreciate > that the postings highlighter as-is is good enough for most users. Thanks > for your awesome work on this highlighter, by the way! > > The problem here are 2 different opinions how highlighting should look > like. What is always wanted by most “technical” people is **not** > “highlighting” like “showing where the search terms match in a specific > document to make the user himself allow to ‘relevance test’ a specific > result”, instead technical people want to have “query debugging”: exactly > showing why a query matches. But this is not what highlighting was made for > *(especially not postings highlighter!).* > > I think Robert’s intention behind the postings highlighter is – and I > fully think he is right – is to just give the “end user” (not “technical > user”) a quick overview of where the terms match in a document, completely > ignoring the type of query. You just want to get a quick context in the > document where the terms of your query match. I always explain it to > customers like “allow the end user to relevance rank the document > themselves”. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [email protected] > > *From:* [email protected] [mailto:[email protected] > <[email protected]>] > *Sent:* Friday, October 10, 2014 4:46 PM > *To:* [email protected] > *Subject:* Re: Highlighters, accurate highlighting, and the > PostingsHighlighter > > On Fri, Oct 10, 2014 at 7:13 AM, Robert Muir <[email protected]> wrote: > > On Fri, Oct 10, 2014 at 12:38 AM, [email protected] > <[email protected]> wrote: > > The fastest > > highlighter we’ve got in Lucene is the PostingsHighlighter but it throws > out > > any positional nature in the query and can highlight more inaccurately > than > > the other two highlighters. mission from the sponsor commissioning this > effort. > > > > Thats because it tries to summarize the document contents wrt to the > query, so the user can decide if its relevant (versus being a debugger > for span queries, or whatever). The algorithms used to do this don't > really get benefits from positions, because they are the same ones > used for regular IR. > > > In short, the "inaccuracy" is important, because this highlighter is > trying to do something different than the other highlighters. > > > I’m confused how inaccuracy is a feature, but nevertheless I appreciate > that the postings highlighter as-is is good enough for most users. Thanks > for your awesome work on this highlighter, by the way! > > > The reason it might be faster in comparison has less to do with the > fact it reads offsets from the postings lists and more to do with the > fact it does not have bad O(n^2) etc algorithms that the other > highlighters do. Its not faster: it just does not blow up. > > > Well, it isn’t cheap to re-analyze the document text (what the default > highlighter does) nor to read term-vectors and sort the tokens (what the > default highlighter does when term vectors are available). At least not > with big docs (lots of text to analyze or large term vectors to read and > sort). My first steps were to try and make the default highlighter faster > but it still isn’t fast enough and it isn’t accurate enough either (for me). > > I looked at the FVH a little but thought I’d skip the heft of term vectors > and use PostingsHighlighter, now that I’m willing to break open these > complex beasts and build what’s needed to meet my accuracy requirements. > > Do you foresee any O(n^2) algorithms in what I’ve said? > > > I don't think you can safely make this highlighter do what you would > like without compromising these goals (relevance of passages, and not > blowing up): for a phrase or span, how can you compute the > within-document freq() without actually reading all those positions > (means blowing up)? With terms its simple, effective, and does not > blow up: freq() -> IDF. Its the same term dependence issue from > regular scoring, not going to be solved in an email to lucene jira > list. The best I can do that is safe is > https://issues.apache.org/jira/browse/LUCENE-4909, and nobody seemed > interested, so it sits. > > > I plan to make simple approximations to score one passage relative to > another. The passage with the most diversity in query terms wins, or at > least is the highest scoring factor. Then, low within-doc-freq (on a > per-term basis). Then, high freq in the passage. Then, shortness of > passage and closeness to the beginning. In short, something fast to > compute and pretty reasonable — my principal requirement is highlighting > accuracy, and needs to support a lot of query types (incl. custom span > queries). > > > So IMO, for scoring spans or intervals or whatever, a different > highlighter is needed that makes some compromises (worse relevance, > willingness to blow up). Hopefully they would be contained so that > most users aren't impacted heavily and blowing up or getting badly > ranked sentences. But I don't think we should make it so > PostingsHighlighter can blow up. There are already two other > highlighters for that. > > > Ok; I’m not sure yet how much from the PostingsHighlighter I’ll re-use but > there is a lot of it that is pertinent to my aims. So much so, probably, > that I can see it being a subclass, or at least belong in the same > package. It uses postings/offsets, (and not term vectors and without > re-analzing text). > > Thanks for your input, Rob. > > ~ David > > >
