Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361650672
########## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ########## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { - return baseIter.preceding(offset); // no change needed + final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 + fragmentEndFromPreceding = baseIter.following(fragmentStart); + if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); + } + final int centerLength = fragmentEndFromPreceding - fragmentStart; + final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: I have a working version with this new behavior. There is just a single assertion that fails [here](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/test/org/apache/solr/highlight/TestUnifiedSolrHighlighter.java#L267). It's because the `fragsize=60` there means to try and get 60 characters on the right side of the match as the default `fragalign` is zero. To pass the test the query parameters need to be changed to have `hl.fragsize=50&hl.fragalign=0.5`. This is not a problem by itself, but there is no way to make the new behavior completely backwards-compatible like the previous one. It could also be a bit jarring to have `fragsize` with a distinctively different meaning compared to the other highlighters: - For original&fastVector the `fragsize` is the target size of the snippet. - For unified the `fragsize` is the size of the contextual text around the match. Ie.: the snippet's target size is `fragsize+matchsize`. I'm still in favor of this change, there's literally half as many calls to `preceding()` and `following()` of the underlying BI and it's much clearer as well. Should I change the failing test and push this for review? What about the backward-compatibility? I'm not sure how to change the docs or the default parameters. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org