I always assumed this was the default behaviour of the Lucene TermHighlighter but I could be mistaken with an older version. I found out that there are major differences between Lucene and Solr though, with which I have similar problems.
Best regards, Evert Wagenaar http://www.evertwagenaar.com/ Op za 27 mei 2017 om 12:08 schreef Dawid Weiss <dawid.we...@gmail.com> > Thanks for your explanation, David. > > I actually found working with all Lucene highlighters pretty > difficult. I have a few requirements which seemed deceptively simple: > > 1) highlight query hit regions (phrase, fuzzy, terms); > 2) try to organise the resulting snippets to visually "center" the hit > regions so that the context of the hit is visible, > 3) keep the snippet limited to ~x characters (this means breaking on > word boundaries, typically, but keeping the overall length of the > snippet close to x). > 4) add visual cues whether the snippet is part of a larger text > (ellipsis). This should be done intelligently -- if a snippet is > actually the whole field or start/ends on the field boundary no > ellipsis should be added. > 5) For performance reasons we typically have a single copy-to field > that is used as the default field for the query parser. But for the > user interface needs we'd have to go back and try to highlight the > original fields that formed this content. This is probably the most > difficult and I didn't expect it to be solved with existing > highlighters, but it'd be a great thing to have eventually. > > Some of the above are possible with existing highlighters, some are > not. Having a limited snippet length and keeping word bounary breaks > turned to be most confusing to me with unified highlighter, for > example. I can't use the sentence break iterator because the text in > question occasionally has super-long word sequences that result in > snippets that are enormous. > > I'll keep thinking. > > Dawid > > On Fri, May 26, 2017 at 3:57 PM, David Smiley <david.w.smi...@gmail.com> > wrote: > > I was recently asked if/how the UnifiedHighlighter can return a Passage > > centered around the highlighted words. I'm responding to a wider > audience > > (java-user list, ...). > > > > Each highlighter implementation fragments the content into passages (with > > highlights) using a different algorithm. > > > > The UnifiedHighlighter (and now defunct PostingsHighlighter from which it > > derives) fragment the content to create passages entirely based on a > > java.text.BreakIterator. BreakIterator only sees/knows about the content > > (it's initialized with it via setText(string); it doesn't know where > > highlighted words are. This is why the default UH BreakIterator impl is > a > > sentence based one and most people probably will let it be. Given how > the > > UH actually uses the BreakIterator, you can create a custom one that is > only > > designed to work with this highlighter that makes some assumptions of how > > it's used, resulting in some fragmentation that isn't so rigidly based on > > the content. The LengthGoalBreakIterator is such a BreakIterator. But > it > > can only "see" the first highlighted word of a passage and make > > fragmentation decisions based on that alone. > > > > The other two highlighters (the original Highlighter and I think the > > FastVectorHighlighter) are more flexible in this regard; they have their > own > > abstraction that allows for Passages to be formed sensitive to where > exactly > > the highlighted words are. Thus you could fairly easily achieve a goal > of > > say, 10 words before the first highlighted word, and highlight more words > > within 10 words of each other until the next is too far away, then 10 > more > > trailing words with the original Highlighter. I suspect > > FastVectorHighlighter can do it this but its API confuses me. The > > FastVectorHighlighter also uses a BreakIterator in > > BreakIteratorBoundaryScanner but it's use is entirely different from how > the > > UnifiedHighlighter uses one. > > > > Perhaps the UnifiedHighlighter should be enhanced to make more flexible > > fragmentation algorithms possible. Today you'd need to override > > FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone; > even > > doing that is annoying and then re-implemenitng that method is onerous > since > > it's so complex -- it's really the heart of the UH. The UH could add an > > entirely new abstraction apart from BreakIterators (with a BI based impl > > available), or perhaps an optional marker interface for UH-aware > > BreakIterators. The former (a new abstraction) would be cleaner, and > might > > also remove a wart in the API due to the statefulness of BreakIterators. > > It's also kinda hard to write a BI correctly. I've implemented a few > already > > and I know. It's an old API. > > > > ~ David > > > > -- > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > > http://www.solrenterprisesearchserver.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Sent from Gmail IPad