Re: [ANN] Elasticsearch experimental highlighter

Nikolas Everett Fri, 11 Apr 2014 14:28:27 -0700

I've just release version 0.0.3 of this plugin.  It fixes:
1.  An error when returning a no match fragment with the sentence
fragmenter if the no_match_size + max_scan is greater than the size of the
document.
2.  Multi-valued fields using the analyze hit_source were pretty broken.
The offsets would be wrong causing garbled highlights or errors.
3.  Fields inside objects would always return no hits for postings and
vectors hit sources.


New tricks:
1.  The max_fragments_scored option can be used to limit the number of
fragments scored when using score order or the top_scoring option.  You can
use it to prevent highlighting documents with many hits from eating a ton
of CPU.  This is more useful with the sentence fragmenter then the scan
fragmenter.  Still, if your documents are megabytes of text you might want
to try it.
2.  The fetch_fields option can be used to return fields next to the
highlighted field.  Its a little jangly but it gets the job done if you are
careful.

Nik


On Thu, Apr 10, 2014 at 4:04 PM, Nikolas Everett <[email protected]> wrote:

> I've been working on a new highlighter on and off for a few weeks and I'd
> love for other folks to try it out:
> https://github.com/wikimedia/search-highlighter
>
> You should try it because:
> 1.  Its pretty quick.
> 2.  It supports many of the features of the other highlighters and lets
> you combine them in new ways.
> 3.  Has a few tricks that none other highlighters have.
> 4.  It doesn't require that you store any extra data information but will
> use what it can to speed itself up.
>
> I've installed it on our beta 
> site<http://simple.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=chess+players&fulltext=Search>so
>  you can run see it in action without installing it.
>
> Let me expand on my list above:
> It doesn't require any extra data and is nice and fast that way for short
> fields.  Once fields get longer [0] reanalyzing them starts to take too
> long so it is best to store offsets in the postings just like the postings
> highlighter.  It can use term vectors the same way that the fast vector
> highlighter can but that is slower than postings and takes up more space.
>
> It supports three fragmenters: one that mimics the postings highlighter,
> one that mimics the fast vector highlighter, and one that always highlights
> the whole value.
>
> It supports matched_fields, no_match_size, and most everything else in the
> highlight api.  It doesn't support require_field_match though.
>
> It adds a handful of tricks like returning the top scoring snippets in
> document order and weighing terms that appear early in the document
> higher.  Nothing difficult, but still cute tricks.  Its reasonably easy to
> implement new tricks so if you have any ideas I'd love to hear them.
>
> I don't think it is really ready for production usage yet but I'd like to
> get there in a week or two.
>
> Thanks for reading,
>
> Nik
>
> [0]: I haven't done the measurements to figure out how long the field has
> to be before it is faster to use postings then reanalyze it.  I did the
> math a few months ago for how long the field has to be before vectors
> become faster.  It was a couple of KB for my analysis chain but I'm not
> sure any of that holds true for this highlighter.  It could be more or less.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1nBOqN_tDNebss99kZUz5PP1zyM%2BBC5V-n3jsMSwkMJQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ANN] Elasticsearch experimental highlighter

Reply via email to