[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537117#comment-15537117
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
----------------------------------------------

After further consideration, it seems best to leave some of the classes common 
between Postings and the Unified highlighters separate.  If we were to use the 
same classes they'd ideally move to a common sub-package that both could share 
and this would introduce unneeded change and hurt potential compatibility for 
any users of those classes.  Keeping them separate also allows for a possible 
improvement to the method highlightFieldsAsObjects which internally creates a 
Map that is promptly thrown away again in the highlight methods.  I briefly 
investigated changing this to return the internal Object[][] array and avoid 
the extra Map allocation, but this creates some awkwardness since the 
Object[][] array sorts the input fields before filling the arrays, which would 
make the API somewhat of a trap for callers.  This undesired behavior is likely 
why the map is being created.  One way to fix this is to generify 
PassageFormatter over it's output type which would allow for a 
PassageFormatter<String> in the case of the DefaultPassageFormatter.  However, 
changing this is a rather involved change that could ultimately result in the 
UnifiedHighlighter itself having a generic type and it was not clear that 
muddying the waters with that right now was a good idea.  However, keeping 
these classes separate will allow for an attempt at that in the future.

In the meantime, I've also pushed a commit to reduce the visibility of the 
MultiTermHighlighting to package protected.  As it stands, I think this patch 
is ready.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>         Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to