[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-7438:
---------------------------------
    Attachment: LUCENE-7438.patch

(I'm attaching the patch)
All new files; no changes to anything existing.

I plan to commit Tuesday to give even more time for review.

I'd also like to commit the patch for the benchmark module (but without the 
query files polluting the file listing?).  However I think for it to be okay, 
it needs to go further and remove the way highlighters were benchmarked before 
this, since it's too hacky/weird to see both, particularly since the existing 
mechanism has hooks into ReadTask (getBenchmarkHighlighter()). I figure the 
entire benchmark module can change at our will without back-compat concern.  

While looking at the FVH and WEH I noticed a feature in which term vecs from 
multiple fields can be used to highlight one field -- useful when you analyze 
the text in different ways into different fields (e.g. stemming vs not).  We're 
actually doing that with the UH in Bloomberg (offset source agnostic of course) 
but I didn't think to add it as a first-class feature to the UH.  Now I think 
we should in a follow-up issue.  I think that requirement is causing us to want 
things like StrictPhraseHelper to be public but it could be moved to package 
protected then, I think.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>         Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to