[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-7438: --------------------------------- Attachment: LUCENE_7438_UH_benchmark.patch I developed a benchmark using Lucene's benchmark module; it's attached as a patch. I made some changes to some existing classes there and it's debatable if those changes are readily committable. The benchmark is on 200k documents from the wikipedia/enwiki data set. While poking through the data and running some queries through Luke, I developed a few lists of queries: terms, phrases, and wildcards. There are some boolean operators in there, and both phrase and wildcard query lists have some occasional TermQuery clauses intermixed too. I had planned to add another query list but this takes awhile. Due to the differences in index data, I have two similar .alg files, one for full term vectors, and the other for postings. I used the postings one to test analysis as well but it could have been on either. It should be the same document data. Since I have multiple query lists, I did a total of 6 benchmark executions, and each time tweaking the file.query.maker.file param and switching to the other .alg once. In the table below, the first (search) row is the time it takes to search and retrieve the data to highlight but not to actually do any highlighting. It's a baseline. The other numbers are over and above that time. In other words, I subtracted the output from the benchmark for the highlighter modes from the baseline so I could measure highlighting time. I tested the standard Highlighter (SH), PostingsHighlighter (PH), FastVectorHighlighter (FVH), and UnifiedHighlighter (UH). The suffix stands for the analysis mode: analysis (A), term vectors (V), postings (P), and postings with light term vectors (PV) -- a mode unique to the UH. The code I wrote to test these, where possible, tried to configure them similarly. ||Impl||terms||phrases||wildcards|| |(search)| 1.08 | 1.22 | 1.46 | |SH_A |3.92 |4.53 |9.33| |UH_A |1.91 |1.70 |3.93| |SH_V |1.83 |1.59 |3.93| |FVH_V |0.85 |1.36 |2.40| |UH_V |0.80 |1.00 |1.94| |PH_P |0.91 |0.57 |4.02| |UH_P |0.61 |0.36 |4.03| |UH_PV |0.52 |0.35 |1.76| I ranked it by offset mode so you can see things working off the same offset source. Judging from all the runs I did and as I tweaked what was being measured, there seems to be a large % err on these numbers, maybe 15%; I'm not sure. Nevertheless the numbers above seem about right after I have done them a bunch of times and tweaked the benchmark. Conclusions: The UH is faster in each offset mode than the others. It is a *lot* faster in Analysis mode than the standard Highlighter is. In some runs I've also seen the FVH beat out the UH. Note that months ago I ascertained that the FVH is not as sensitive to the performance of an underlying BreakIterator as UH & PH are -- so "cheap" BI's like the char separator one make for a UH that handily beats FVH but expensive BI's (like the default JDK provided) make these two more competitive. One cool observation that surprised me is the phrase query difference between PH & UH. Despite the accuracy mode of UH (set to true for these benchmarks), it's still faster than PH. I temporarily disabled it and re-ran and found that the UH _got slower_ when it treated them like PH does (bag of terms). I believe that is because the filtering of these terms positions the UH does, while it intrinsically has some cost, seems to be cheaper than the main highlighting loop seeing more occurrences of terms that result in more Passages (which also needs to invoke the BreakIterator). Accuracy & speed -- Cool! Of course this benchmark could be improved... and it could be modified to measure highlighting shorter text or longer text. And maybe try that case of an optimized index and lots of terms in the query. Maybe benchmark queries with SpanMultiTermQuery in them, or ones with phrases & wildcards. And I was going to measure memory allocation but against this large matrix I changed my mind as I've got other things to get to. I had done so months ago and the results looked great. > UnifiedHighlighter > ------------------ > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Affects Versions: 6.2 > Reporter: Timothy M. Rodriguez > Assignee: David Smiley > Attachments: LUCENE_7438_UH_benchmark.patch > > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org