[ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998660#comment-16998660
 ] 

Nándor Mátravölgyi commented on LUCENE-9093:
--------------------------------------------

I'm back with a patch! [^LUCENE-9093.patch]

This adds a `hl.fragalign` parameter to the Unified Highlighter. I've added a 
description about it in the docs on how it works. I've also updated the related 
tests. I've opted to keep the new feature backward-compatible. From the new 
docs:
{noformat}
Fragment alignment can influence where the match in a passage is positioned. 
This floating point value is used to break the remaining `hl.fragsize` of the 
passage around the match. The default value of `0.0` means to align the match 
to the left, this is the backward-compatible setting. A value of `0.5` would 
mean that equal amount of text should be around the match on both sides, while 
`1.0` to align it to the right. Note: there are situations where the requested 
alignment is not plausible. This depends on the length of the match, the used 
breakiterator and the text content around the match.

Before the introduction of this parameter all passages had left-aligned 
matches. Changing the `hl.bs.type` to `WORD` and the `hl.fragalign` to `0.5` 
will make results that closely resemble what the other highlighters produce by 
default.
{noformat}
I must say that I've changed my mind about the abstraction. A proper one 
instead of the chained BreakIterators would be much nicer. The 
LengthGoalBreakIterator already had a few behavioral differences to how a 
generic BreakIterator works. This change makes it work even less like a 
BreakIterator. It should be totally fine in it's specifically crafted universe. 
However a better abstraction/structure would be required if we want 
style-points as well. The difficulty is that the chaining of the BreakItartors 
would need a refactor which has far greater scope than this issue for example.

> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9093
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9093
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Tim Retout
>            Priority: Major
>         Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to