Re: [ANN] Elasticsearch experimental highlighter

Bruce Ritchie Thu, 29 May 2014 12:26:56 -0700

Hi Nikolas,

I'm likely to test this in the next couple of weeks (I'm still on 0.90.9) 
however I've a question on performance. 'Its pretty quick' meaning 
comparable performance to the posting highlighter, the fast vector 
highlighter, or just quick enough for your use case?


The reason why I'm asking is because highlighting performance is the 
largest issue I face currently. Our documents have hundreds of very short 
fields (well over a thousand if you count the sub fields in a multi-field 
field) and listing every field/sub field to highlight causes queries to be 
10-20x slower than highlighting just a single field (100ms -> 2100ms for 
example). I can't use the _all field because I need to know the actual 
field that was highlighted and only the fvh highlighter returns the high 
quality results we need. I'm actually toying with the idea of doing a 
two-phase search where the first phase only highlights a few fields that 
commonly hit with a second phase that only searches the remaining hits that 
didn't highlight on the first pass. That approach may work but I'd rather 
just have a highlighter that was faster :) 


All the best,

Bruce Ritchie



On Thursday, April 10, 2014 4:04:57 PM UTC-4, Nikolas Everett wrote:
>
> I've been working on a new highlighter on and off for a few weeks and I'd 
> love for other folks to try it out: 
> https://github.com/wikimedia/search-highlighter
>
> You should try it because:
> 1.  Its pretty quick.
> 2.  It supports many of the features of the other highlighters and lets 
> you combine them in new ways.
> 3.  Has a few tricks that none other highlighters have.
> 4.  It doesn't require that you store any extra data information but will 
> use what it can to speed itself up.
>
> I've installed it on our beta site 
> <http://simple.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=chess+players&fulltext=Search>
>  
> so you can run see it in action without installing it.  
>
> Let me expand on my list above:
> It doesn't require any extra data and is nice and fast that way for short 
> fields.  Once fields get longer [0] reanalyzing them starts to take too 
> long so it is best to store offsets in the postings just like the postings 
> highlighter.  It can use term vectors the same way that the fast vector 
> highlighter can but that is slower than postings and takes up more space.
>
> It supports three fragmenters: one that mimics the postings highlighter, 
> one that mimics the fast vector highlighter, and one that always highlights 
> the whole value.
>
> It supports matched_fields, no_match_size, and most everything else in the 
> highlight api.  It doesn't support require_field_match though.
>
> It adds a handful of tricks like returning the top scoring snippets in 
> document order and weighing terms that appear early in the document 
> higher.  Nothing difficult, but still cute tricks.  Its reasonably easy to 
> implement new tricks so if you have any ideas I'd love to hear them.
>
> I don't think it is really ready for production usage yet but I'd like to 
> get there in a week or two.
>
> Thanks for reading,
>
> Nik
>
> [0]: I haven't done the measurements to figure out how long the field has 
> to be before it is faster to use postings then reanalyze it.  I did the 
> math a few months ago for how long the field has to be before vectors 
> become faster.  It was a couple of KB for my analysis chain but I'm not 
> sure any of that holds true for this highlighter.  It could be more or less.
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7b125714-48dd-4bca-a58d-d56acac94d47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ANN] Elasticsearch experimental highlighter

Reply via email to