[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

David Smiley (JIRA) Wed, 21 Sep 2016 06:28:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509898#comment-15509898
 ]


David Smiley commented on LUCENE-7438:
--------------------------------------

BTW just to re-inforce the wide precision in the numbers, note that UH_P & 
UH_PV should theoretically highlight term queries with identical code (PV gets 
internally optimized to P when there are no wildcards) yet it differed 0.61 to 
0.52 (17%).

Quoting myself earlier:
bq. For the surface classes users use: Passage, PassageScorer, 
PassageFormatter, DefaultPassageFormatter. – I don't think it good to have 
users use parts of another highlighter (postingshighlight), which is weird for 
users. I propose copying these with a leading 'U', i.e. UPassage etc. That said 
if others think that's a worse trade-off, it's no big deal to me. Once 
o.a.l.s.ph.Passage's constructor is public, it's possible to do that.

Reconsidering... I think it's fine to use these classes from the 
PostingsHighlighter... after all, these classes are only actually seen by users 
when they want to customize them.  A user can simply call public methods on UH 
and reference zero other classes (getting back strings based on default impls 
of those classes).  If there are no objections to this path, Tim/I can update 
the PR along with the other changes listed in the same comment I just 
referenced.

BTW this highlighter has been in production for about 6 months.  It rooted out 
a couple bugs.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>         Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

Reply via email to