Hi Bruce, I'm not actually sure it'll work on 0.90.X - I didn't start working on it until 1.1.0.
"Its pretty quick" means lots of things, unfortunately. If you configure it to segment the source like the postings highlighter it is typically about 10% slower then the posting highlighter. If you configure it to segment more like the FVH (the default) it is generally faster then the posting highlighter. What feature of the fvh do you need? I didn't implement them all, in particular, I don't have require_field_match support. In recent releases I've grown phrase support and I'll make another release sometime soon that fixes some bugs there. It might be best to just try it an see if it works for you. Before I deployed this highlighting was the largest time consumer on my cluster and after its pretty much vanished. The fvh can be very slow at some things. Just turning on the highlighter may not actually be more efficient because you have term vectors on each of your fields. The highlighter will attempt to use them but that might not be the best choice everwhere. For short fields its probably better to reanalyze them then load the term vectors. I'm not clear on exactly how many characters or words cause a field to be "short" in this way, but I've seen it happen. Also, for the longer fields, you are probably better of switching from term vectors with_positions_offsets to storing the offsets in the postings. This means configuring the field "as though" you were going to use the postings highlighter. The term vectors might be faster in some cases, but I don't know which. You can force reanalyzing the fields by setting the "hit_source" to "analyze". Anyway, let me know how it goes, NIk On Thu, May 29, 2014 at 3:26 PM, Bruce Ritchie <[email protected]> wrote: > Hi Nikolas, > > I'm likely to test this in the next couple of weeks (I'm still on 0.90.9) > however I've a question on performance. 'Its pretty quick' meaning > comparable performance to the posting highlighter, the fast vector > highlighter, or just quick enough for your use case? > > The reason why I'm asking is because highlighting performance is the > largest issue I face currently. Our documents have hundreds of very short > fields (well over a thousand if you count the sub fields in a multi-field > field) and listing every field/sub field to highlight causes queries to be > 10-20x slower than highlighting just a single field (100ms -> 2100ms for > example). I can't use the _all field because I need to know the actual > field that was highlighted and only the fvh highlighter returns the high > quality results we need. I'm actually toying with the idea of doing a > two-phase search where the first phase only highlights a few fields that > commonly hit with a second phase that only searches the remaining hits that > didn't highlight on the first pass. That approach may work but I'd rather > just have a highlighter that was faster :) > > > All the best, > > Bruce Ritchie > > > > On Thursday, April 10, 2014 4:04:57 PM UTC-4, Nikolas Everett wrote: >> >> I've been working on a new highlighter on and off for a few weeks and I'd >> love for other folks to try it out: https://github.com/wikimedia/ >> search-highlighter >> >> You should try it because: >> 1. Its pretty quick. >> 2. It supports many of the features of the other highlighters and lets >> you combine them in new ways. >> 3. Has a few tricks that none other highlighters have. >> 4. It doesn't require that you store any extra data information but will >> use what it can to speed itself up. >> >> I've installed it on our beta site >> <http://simple.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=chess+players&fulltext=Search> >> so you can run see it in action without installing it. >> >> Let me expand on my list above: >> It doesn't require any extra data and is nice and fast that way for short >> fields. Once fields get longer [0] reanalyzing them starts to take too >> long so it is best to store offsets in the postings just like the postings >> highlighter. It can use term vectors the same way that the fast vector >> highlighter can but that is slower than postings and takes up more space. >> >> It supports three fragmenters: one that mimics the postings highlighter, >> one that mimics the fast vector highlighter, and one that always highlights >> the whole value. >> >> It supports matched_fields, no_match_size, and most everything else in >> the highlight api. It doesn't support require_field_match though. >> >> It adds a handful of tricks like returning the top scoring snippets in >> document order and weighing terms that appear early in the document >> higher. Nothing difficult, but still cute tricks. Its reasonably easy to >> implement new tricks so if you have any ideas I'd love to hear them. >> >> I don't think it is really ready for production usage yet but I'd like to >> get there in a week or two. >> >> Thanks for reading, >> >> Nik >> >> [0]: I haven't done the measurements to figure out how long the field has >> to be before it is faster to use postings then reanalyze it. I did the >> math a few months ago for how long the field has to be before vectors >> become faster. It was a couple of KB for my analysis chain but I'm not >> sure any of that holds true for this highlighter. It could be more or less. >> > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/7b125714-48dd-4bca-a58d-d56acac94d47%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/7b125714-48dd-4bca-a58d-d56acac94d47%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0navQAoyD7ZBuiDt0pyyqOb8_DphEwTmvym%3D1Jgrgrmw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
