[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

Robert Muir (JIRA) Thu, 08 Sep 2016 06:58:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473933#comment-15473933
 ]


Robert Muir commented on LUCENE-7438:
-------------------------------------

I think there are room for plenty of new highlighters in lucene, so its great 
to have another one. I do get the feeling from some issues etc, that some feel 
there "can be only one", but I don't see any good reasons for that. On the 
other hand, like codecs and other things in lucene, we should explore different 
approaches that give the user more choices (like this new highlighter here). 

I think this is especially important because of how "personal" highlighting is 
to the app, and the fact that performance/relevance is tricky stuff here 
depending on how the app works! For example about the reuse note: this 
highlighter discards reuse of some internal lucene structures, but under some 
circumstances (e.g. certain query structures/Directory impl/doc sizes/top-N 
sizes/stopwords or lack thereof) this could indeed matter a lot. For PH it does 
this simply because it tries to maximize perf everywhere (possibly to the 
extreme: perhaps it really is the wrong tradeoff, but that was a "different" 
direction to explore). There are lots of ways these things can perform or be 
very slow, and a lot of it is hard to generalize across all use-cases!

As far as the duplication of classes, I'd be a little careful before 
refactoring too much of it, because of that very reason. Maybe UH needs to 
ultimately go in different directions than PH and we should just let it do that.

For example ranking: PH disregards query structure and tries to use a 
bag-of-words approach with something similar to traditional ranking for that, 
the idea is that hopefully that stuff works well on a small scale too.

But UH might need something else: if it attempts to use more query structure 
than bag-of-words, then UH might need to do something else. I haven't looked to 
see how things like IDF are computed there, that's just an example.

And maybe the right direction for PH, given what it tries to do, is to do 
something like LUCENE-4909... that sorta sits out there because we don't have a 
good way of measuring quality? 

I also do worry a bit about making internal-only classes like 
MultitermHighlighting public to the user, I think this has a real heavy API 
cost. Maybe for that one in particular its the right way to go, given how 
hacky/hairy it is especially. Maybe it can be renamed to something better to 
limit the confusion :) This problem isn't really unique to highlighters though, 
its something that should be addressed better in general, e.g. with 
internal-only packages that are hidden or something like that.

Anyway, these are just some general thoughts. Glad to see we will have more 
choices.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

Reply via email to