Spot on Walter.  “trust in the engine” is precisely how it was put to me
for a technical user-community that I’m aiming this at.  When they see
“wrong” highlighting, they lose trust.  One might argue it’s a matter of
educating the user but I think it’s a reasonable requirement (a reasonable
thing to want of Lucene).

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Oct 10, 2014 at 12:55 PM, Walter Underwood <[email protected]>
wrote:

> I think of snippets and highlighting as explaining to the end user why the
> engine decided this was relevant. This tends to increase the user’s trust
> in the engine even when the results are not relevant.
>
> wunder
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/
>
>
> On Oct 10, 2014, at 9:37 AM, Uwe Schindler <[email protected]> wrote:
>
> Hi,
>
> > I’m confused how inaccuracy is a feature, but nevertheless I appreciate
> that the postings highlighter as-is is good enough for most users.  Thanks
> for your awesome work on this highlighter, by the way!
>
> The problem here are 2 different opinions how highlighting should look
> like. What is always wanted by most “technical” people is **not**
> “highlighting” like “showing where the search terms match in a specific
> document to make the user himself allow to ‘relevance test’ a specific
> result”, instead technical people want to have “query debugging”: exactly
> showing why a query matches. But this is not what highlighting was made for
>  *(especially not postings highlighter!).*
>
> I think Robert’s intention behind the postings highlighter is – and I
> fully think he is right – is to just give the “end user” (not “technical
> user”) a quick overview of where the terms match in a document, completely
> ignoring the type of query. You just want to get a quick context in the
> document where the terms of your query match. I always explain it to
> customers like “allow the end user to relevance rank the document
> themselves”.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
> *From:* [email protected] [mailto:[email protected]
> <[email protected]>]
> *Sent:* Friday, October 10, 2014 4:46 PM
> *To:* [email protected]
> *Subject:* Re: Highlighters, accurate highlighting, and the
> PostingsHighlighter
>
> On Fri, Oct 10, 2014 at 7:13 AM, Robert Muir <[email protected]> wrote:
>
> On Fri, Oct 10, 2014 at 12:38 AM, [email protected]
> <[email protected]> wrote:
> > The fastest
> > highlighter we’ve got in Lucene is the PostingsHighlighter but it throws
> out
> > any positional nature in the query and can highlight more inaccurately
> than
> > the other two highlighters. mission from the sponsor commissioning this
> effort.
> >
>
> Thats because it tries to summarize the document contents wrt to the
> query, so the user can decide if its relevant (versus being a debugger
> for span queries, or whatever). The algorithms used to do this don't
> really get benefits from positions, because they are the same ones
> used for regular IR.
>
>
> In short, the "inaccuracy" is important, because this highlighter is
> trying to do something different than the other highlighters.
>
>
> I’m confused how inaccuracy is a feature, but nevertheless I appreciate
> that the postings highlighter as-is is good enough for most users.  Thanks
> for your awesome work on this highlighter, by the way!
>
>
> The reason it might be faster in comparison has less to do with the
> fact it reads offsets from the postings lists and more to do with the
> fact it does not have bad O(n^2) etc algorithms that the other
> highlighters do. Its not faster: it just does not blow up.
>
>
> Well, it isn’t cheap to re-analyze the document text (what the default
> highlighter does) nor to read term-vectors and sort the tokens (what the
> default highlighter does when term vectors are available).  At least not
> with big docs (lots of text to analyze or large term vectors to read and
> sort).  My first steps were to try and make the default highlighter faster
> but it still isn’t fast enough and it isn’t accurate enough either (for me).
>
> I looked at the FVH a little but thought I’d skip the heft of term vectors
> and use PostingsHighlighter, now that I’m willing to break open these
> complex beasts and build what’s needed to meet my accuracy requirements.
>
> Do you foresee any O(n^2) algorithms in what I’ve said?
>
>
> I don't think you can safely make this highlighter do what you would
> like without compromising these goals (relevance of passages, and not
> blowing up): for a phrase or span, how can you compute the
> within-document freq() without actually reading all those positions
> (means blowing up)? With terms its simple, effective, and does not
> blow up: freq() -> IDF. Its the same term dependence issue from
> regular scoring, not going to be solved in an email to lucene jira
> list. The best I can do that is safe is
> https://issues.apache.org/jira/browse/LUCENE-4909, and nobody seemed
> interested, so it sits.
>
>
> I plan to make simple approximations to score one passage relative to
> another.  The passage with the most diversity in query terms wins, or at
> least is the highest scoring factor. Then, low within-doc-freq (on a
> per-term basis).  Then, high freq in the passage.  Then, shortness of
> passage and closeness to the beginning.  In short, something fast to
> compute and pretty reasonable — my principal requirement is highlighting
> accuracy, and needs to support a lot of query types (incl. custom span
> queries).
>
>
> So IMO, for scoring spans or intervals or whatever, a different
> highlighter is needed that makes some compromises (worse relevance,
> willingness to blow up). Hopefully they would be contained so that
> most users aren't impacted heavily and blowing up or getting badly
> ranked sentences. But I don't think we should make it so
> PostingsHighlighter can blow up. There are already two other
> highlighters for that.
>
>
> Ok; I’m not sure yet how much from the PostingsHighlighter I’ll re-use but
> there is a lot of it that is pertinent to my aims.  So much so, probably,
> that I can see it being a subclass, or at least belong in the same
> package.  It uses postings/offsets, (and not term vectors and without
> re-analzing text).
>
> Thanks for your input, Rob.
>
> ~ David
>
>
>

Reply via email to