[
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15474118#comment-15474118
]
David Smiley commented on LUCENE-7438:
--------------------------------------
Thanks for chiming in Rob.
For the present while there are feature gaps (see "Missing features" above), I
don't think we can suggest that there be only one highlighter. I admit I see
that as potential eventuality that I think is desirable, but it's a moot
discussion right now. That being said, the UH, being based on the PH, does
everything it does and more. It scores/ranks and formats using the same code.
The very kernel of the highlighter that produces the Passages[] (now in
FieldHighlighter.highlightOffsetsEnums) is essentially the same. Still, I
don't think we should do any removing of highlighters at this time.
Eventually, we can ask ourselves, what is highlighter XYZ giving us over the
UnifiedHighlighter? And then we can see if we (and other users) think it's
worth keeping it.
RE PostingsHighlighter perf trade-offs:
Yeah I know it's possible to craft an extreme case that would exercise the
PostingsEnum reuse -- loads of terms in the query and an optimized index.
Once we have some benchmarking, we can see how much of a hit was lost by not
re-using. That feature was retained in the UH for many months until just
recently when it underwent a large refactor to simplify things. Other than
this, I don't believe there are any tricks in the PH that we removed in the UH.
RE ranking/scoring "needs":
I'm not aware that the UH might have different passing scoring "needs" than the
PH. The PH's algorithm seems really nice to me; I didn't put any thought into
this aspect. But yeah maybe there might be improvements for phrase/span
queries in particular. By the way, PhraseHelper, simply filters out certain
occurrences of certain terms. Perhaps the frequency of the span might be used
in scoring? But to know that, you must iterate them, and then you lose lazy
iteration. Perhaps someone wanting to trade-off performance for possibly
better passage relevance would make this trade-off? We/BLAW have no plans to
do that. If someone comes along with such a requirement, I hope we can
accommodate that interesting direction.
RE moving / renaming / visibility
If you have specific suggestions (e.g. w.r.t. MultiTermHighlighting) on how
they might be renamed and re-shuffled to different packages than I'd love to
hear your thoughts on that. Some things of the UH are expressly public because
we/BLAW are using those endpoints but we/BLAW don't use MultiTermHighlighting
at this time. But I could imagine some custom wildcard query coming into
existence and it would be a PITA if we couldn't help MTH understand some new
query. Similar for WSTE.
BTW there is a visibility test expressly for ensuring certain things are public.
> UnifiedHighlighter
> ------------------
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Affects Versions: 6.2
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is
> able to highlight using offsets in either postings, term vectors, or from
> analysis (a TokenStream). Lucene’s existing highlighters are mostly
> demarcated along offset source lines, whereas here it is unified -- hence
> this proposed name. In this highlighter, the offset source strategy is
> separated from the core highlighting functionalty. The UnifiedHighlighter
> further improves on the PostingsHighlighter’s design by supporting accurate
> phrase highlighting using an approach similar to the standard highlighter’s
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset
> source strategythat utilizes postings and “light” term vectors (i.e. just the
> terms) for highlighting multi-term queries (wildcards) without resorting to
> analysis. Phrase highlighting and wildcard highlighting can both be disabled
> if you’d rather highlight a little faster albeit not as accurately reflecting
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the
> other highlighters and the results were exciting! It’s tempting to share
> those results but it’s definitely due for another benchmark, so we’ll work on
> that. Performance was the main motivator for creating the UnifiedHighlighter,
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy
> requirements) wasn’t fast enough, even with term vectors along with several
> improvements we contributed back, and even after we forked it to highlight in
> multiple threads.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]