[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-7438:
---------------------------------
    Attachment: LUCENE_7438_UH_benchmark.patch

I developed a benchmark using Lucene's benchmark module; it's attached as a 
patch.  I made some changes to some existing classes there and it's debatable 
if those changes are readily committable.  The benchmark is on 200k documents 
from the wikipedia/enwiki data set.  While poking through the data and running 
some queries through Luke, I developed a few lists of queries: terms, phrases, 
and wildcards.  There are some boolean operators in there, and both phrase and 
wildcard query lists have some occasional TermQuery clauses intermixed too.  I 
had planned to add another query list but this takes awhile.  Due to the 
differences in index data, I have two similar .alg files, one for full term 
vectors, and the other for postings. I used the postings one to test analysis 
as well but it could have been on either.  It should be the same document data. 
 Since I have multiple query lists, I did a total of 6 benchmark executions, 
and each time tweaking the file.query.maker.file param and switching to the 
other .alg once.  In the table below, the first (search) row is the time it 
takes to search and retrieve the data to highlight but not to actually do any 
highlighting.  It's a baseline.  The other numbers are over and above that 
time.  In other words, I subtracted the output from the benchmark for the 
highlighter modes from the baseline so I could measure highlighting time.

I tested the standard Highlighter (SH), PostingsHighlighter (PH), 
FastVectorHighlighter (FVH), and UnifiedHighlighter (UH).  The suffix stands 
for the analysis mode: analysis (A), term vectors (V), postings (P), and 
postings with light term vectors (PV) -- a mode unique to the UH.  The code I 
wrote to test these, where possible, tried to configure them similarly.  

||Impl||terms||phrases||wildcards||
|(search)| 1.08 | 1.22 | 1.46 |
|SH_A   |3.92   |4.53   |9.33|
|UH_A   |1.91   |1.70   |3.93|
|SH_V   |1.83   |1.59   |3.93|
|FVH_V  |0.85   |1.36   |2.40|
|UH_V   |0.80   |1.00   |1.94|
|PH_P   |0.91   |0.57   |4.02|
|UH_P   |0.61   |0.36   |4.03|
|UH_PV  |0.52   |0.35   |1.76|

I ranked it by offset mode so you can see things working off the same offset 
source.  Judging from all the runs I did and as I tweaked what was being 
measured, there seems to be a large % err on these numbers, maybe 15%; I'm not 
sure.  Nevertheless the numbers above seem about right after I have done them a 
bunch of times and tweaked the benchmark.

Conclusions:  The UH is faster in each offset mode than the others.  It is a 
*lot* faster in Analysis mode than the standard Highlighter is.  In some runs 
I've also seen the FVH beat out the UH.  Note that months ago I ascertained 
that the FVH is not as sensitive to the performance of an underlying 
BreakIterator as UH & PH are -- so "cheap" BI's like the char separator one 
make for a UH that handily beats FVH but expensive BI's (like the default JDK 
provided) make these two more competitive.  

One cool observation that surprised me is the phrase query difference between 
PH & UH.  Despite the accuracy mode of UH (set to true for these benchmarks), 
it's still faster than PH.  I temporarily disabled it and re-ran and found that 
the UH _got slower_ when it treated them like PH does (bag of terms).  I 
believe that is because the filtering of these terms positions the UH does, 
while it intrinsically has some cost, seems to be cheaper than the main 
highlighting loop seeing more occurrences of terms that result in more Passages 
(which also needs to invoke the BreakIterator).  Accuracy & speed -- Cool!

Of course this benchmark could be improved... and it could be modified to 
measure highlighting shorter text or longer text.  And maybe try that case of 
an optimized index and lots of terms in the query.  Maybe benchmark queries 
with SpanMultiTermQuery in them, or ones with phrases & wildcards. And I was 
going to measure memory allocation but against this large matrix I changed my 
mind as I've got other things to get to.  I had done so months ago and the 
results looked great.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>         Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to