[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

David Smiley (JIRA) Thu, 08 Sep 2016 08:13:51 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15474118#comment-15474118
 ]


David Smiley commented on LUCENE-7438:
--------------------------------------

Thanks for chiming in Rob.

For the present while there are feature gaps (see "Missing features" above), I 
don't think we can suggest that there be only one highlighter.  I admit I see 
that as potential eventuality that I think is desirable, but it's a moot 
discussion right now.  That being said, the UH, being based on the PH, does 
everything it does and more.  It scores/ranks and formats using the same code.  
The very kernel of the highlighter that produces the Passages[] (now in 
FieldHighlighter.highlightOffsetsEnums) is essentially the same.  Still, I 
don't think we should do any removing of highlighters at this time.  
Eventually, we can ask ourselves, what is highlighter XYZ giving us over the 
UnifiedHighlighter?  And then we can see if we (and other users) think it's 
worth keeping it.

RE PostingsHighlighter perf trade-offs:

Yeah I know it's possible to craft an extreme case that would exercise the 
PostingsEnum reuse --  loads of terms in the query and an optimized index.  
Once we have some benchmarking, we can see how much of a hit was lost by not 
re-using.  That feature was retained in the UH for many months until just 
recently when it underwent a large refactor to simplify things.  Other than 
this, I don't believe there are any tricks in the PH that we removed in the UH.

RE ranking/scoring "needs":

I'm not aware that the UH might have different passing scoring "needs" than the 
PH.  The PH's algorithm seems really nice to me; I didn't put any thought into 
this aspect.  But yeah maybe there might be improvements for phrase/span 
queries in particular.  By the way, PhraseHelper, simply filters out certain 
occurrences of certain terms.  Perhaps the frequency of the span might be used 
in scoring?  But to know that, you must iterate them, and then you lose lazy 
iteration.  Perhaps someone wanting to trade-off performance for possibly 
better passage relevance would make this trade-off?  We/BLAW have no plans to 
do that.  If someone comes along with such a requirement, I hope we can 
accommodate that interesting direction.

RE moving / renaming / visibility

If you have specific suggestions (e.g. w.r.t. MultiTermHighlighting) on how 
they might be renamed and re-shuffled to different packages than I'd love to 
hear your thoughts on that.  Some things of the UH are expressly public because 
we/BLAW are using those endpoints but we/BLAW don't use MultiTermHighlighting 
at this time.  But I could imagine some custom wildcard query coming into 
existence and it would be a PITA if we couldn't help MTH understand some new 
query.  Similar for WSTE.

BTW there is a visibility test expressly for ensuring certain things are public.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

Reply via email to