[ https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002548#comment-17002548 ]
Nándor Mátravölgyi edited comment on LUCENE-9093 at 12/24/19 12:02 AM: ----------------------------------------------------------------------- I could look into making this a github PR tomorrow... I'll change the default fragalign to 0.5 as well. It also works in SENTENCE mode, but the results won't be as accurate in some cases. Let me elaborate. In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes the decision on where a slice can happen. The first slice always contains the match. The LengthGoalBreakIterator will decide which side of the first slice should the selected BI add more slices to. The logic is generic and will work regardless of the underlying BI. Since the snippet will be grown until it reaches fragsize, the size of the last slice to be added will determine how big to overshoot is. Examples in SENTENCE mode: Example text: _Hello Susan! I cannot believe the weather is unreal again! The sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not panic._ # If the fragsize is smaller than the first slice (sentence in this case), no expansion will happen in either direction. Note that fragalign is N/A in this case. {noformat} q=sky&hl.fragalign=0.5&hl.fragsize=10 makes snippet length of 17 The <b>sky</b> is green.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 0.5, the slice will be expanded on the left first and then on the right if any space is left. {noformat} q=sky&hl.fragalign=0.5&hl.fragsize=30 makes snippet length of 63 I cannot believe the weather is unreal again! The <b>sky</b> is green. q=sky&hl.fragalign=0.5&hl.fragsize=80 makes snippet length of 119 I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic. q=sky&hl.fragalign=0.5&hl.fragsize=120 makes snippet length of 132 Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 0, the slice will be expanded on the right only. (the match is anchored to 0/left/begin) {noformat} q=sky&hl.fragalign=0.0&hl.fragsize=30 makes snippet length of 73 The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic. q=sky&hl.fragalign=0.0&hl.fragsize=80 makes snippet length of 90 The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not panic.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 1, the slice will be expanded on the left only. (the match is anchored to 1/right/end) {noformat} q=sky&hl.fragalign=1.0&hl.fragsize=30 makes snippet length of 63 I cannot believe the weather is unreal again! The <b>sky</b> is green. q=sky&hl.fragalign=1.0&hl.fragsize=70 makes snippet length of 76 Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat} In the above examples there are big overshoots of the fragsize. 63 instead of 30 (+110%) and 119 instead of 80 (+49%). These would also occur if the fragalign would be 0.1, but the alignment would be even less accurate in cases where the left expansion overshoots: {noformat} q=sky&hl.fragalign=0.1&hl.fragsize=30 makes snippet length of 63 I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat} This is because the order of expansion is strictly left first. I guess this could be improved if so desired. In summary, to ensure the accuracy of fragsize & fragalign parameters, they have to be proportional to the approximate size of the slices. Here's how the worst expected overshoot can be calculated: {noformat} float WorstOvershootPercent(float fragsize, float avgSliceLength) { return ((((fragsize-1)+avgSliceLength) / fragsize)-1)*100; } WORD: (words are usually 12-25 characters most) WorstOvershootPercent(15, 12) => 73.34% WorstOvershootPercent(100, 25) => 24.00% WorstOvershootPercent(300, 25) => 8.00% SENTENCE: (a sentence can be very long) WorstOvershootPercent(300, 300) => 99.66% WorstOvershootPercent(300, 500) => 166.34% WorstOvershootPercent(2000, 300) => 14.95% WorstOvershootPercent(2000, 500) => 24.95%{noformat} The other highlighters have similar rules for this. The only thing that can improve this easily in some cases, is to search the closest length to the fragsize instead of the minimum. The LengthGoalBreakIterator has a closestTo-mode, but it's not usable because it would require yet another parameter. ([view on github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330]) Using that mode could make an undershoot that is closer to the desired size than the overshoot. was (Author: myusername8): I could look into making this a github PR tomorrow... I'll change the default fragalign to 0.5 as well. It also works in SENTENCE mode, but the results won't be as accurate in some cases. Let me elaborate. In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes the decision on where a slice can happen. The first slice always contains the match. The LengthGoalBreakIterator will decide which side of the first slice should the selected BI add more slices to. The logic is generic and will work regardless of the underlying BI. Since the snippet will be grown until it reaches fragsize, the size of the last slice to be added will determine how big to overshoot is. Examples in SENTENCE mode: Example text: _Hello Susan! I cannot believe the weather is unreal again! The sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not panic._ # If the fragsize is smaller than the first slice (sentence in this case), no expansion will happen in either direction. Note that fragalign is N/A in this case. {noformat} q=sky&hl.fragalign=0.5&hl.fragsize=10 makes snippet length of 17 The <b>sky</b> is green.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 0.5, the slice will be expanded on the left first and then on the right if any space is left. {noformat} q=sky&hl.fragalign=0.5&hl.fragsize=30 makes snippet length of 63 I cannot believe the weather is unreal again! The <b>sky</b> is green. q=sky&hl.fragalign=0.5&hl.fragsize=80 makes snippet length of 119 I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic. q=sky&hl.fragalign=0.5&hl.fragsize=120 makes snippet length of 132 Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 0, the slice will be expanded on the right only. (the match is anchored to 0/left/begin) {noformat} q=sky&hl.fragalign=0.0&hl.fragsize=30 makes snippet length of 73 The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic. q=sky&hl.fragalign=0.0&hl.fragsize=80 makes snippet length of 90 The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not panic.{noformat} # If the fragsize is bigger than the first slice and the fragalign is 1, the slice will be expanded on the left only. (the match is anchored to 1/right/end) {noformat} q=sky&hl.fragalign=1.0&hl.fragsize=30 makes snippet length of 63 I cannot believe the weather is unreal again! The <b>sky</b> is green. q=sky&hl.fragalign=1.0&hl.fragsize=70 makes snippet length of 76 Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat} In the above examples there are big overshoots of the fragsize. 63 instead of 30 (+110%) and 119 instead of 80 (+49%). These would also occur if the fragalign would be 0.1, but the alignment would be even less accurate in cases where the left expansion overshoots: {noformat} q=sky&hl.fragalign=0.1&hl.fragsize=30 makes snippet length of 63 I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat} This is because the order of expansion is strictly left first. I guess this could be improved if so desired. In summary, to ensure the accuracy of fragsize & fragalign parameters, they have to be proportional to the approximate size of the slices. Here's how the worst expected overshoot can be calculated: {noformat} float WorstOvershootPercent(float fragsize, float avgSliceLength) { return ((((fragsize-1)+avgSliceLength) / fragsize)-1)*100; } WORD: (words are usually 12-25 characters most) WorstOvershootPercent(15, 12) => 73.34% WorstOvershootPercent(100, 25) => 24.00% WorstOvershootPercent(300, 25) => 8.00% SENTENCE: (a sentence can be very long) WorstOvershootPercent(300, 300) => 99.66% WorstOvershootPercent(300, 500) => 166.34% WorstOvershootPercent(2000, 300) => 14.95% WorstOvershootPercent(2000, 500) => 24.95%{noformat} The other highlighters have similar rules for this. The only thing that can improve this easily in some cases, is to search the closest length to the fragsize instead of the minimum. The LengthGoalBreakIterator has a closestTo-mode, but it's not usable because it would require yet another parameter. ([view on github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330]) Using that mode could make an undershoot that is closer to the desired size than the overshoot. > Unified highlighter with word separator never gives context to the left > ----------------------------------------------------------------------- > > Key: LUCENE-9093 > URL: https://issues.apache.org/jira/browse/LUCENE-9093 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: Tim Retout > Priority: Major > Attachments: LUCENE-9093.patch > > > When using the unified highlighter with hl.bs.type=WORD, I am not able to get > context to the left of the matches returned; only words to the right of each > match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1. > Without context to the left of a match, the highlighted snippets are much > less useful for understanding where the match appears in a document. > As an example, using the techproducts data with Solr 7.1, given a search for > "apple", highlighting the "features" field: > http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified > I see this snippet: > "<em>Apple</em> Lossless, H.264 video" > Note that "Apple" is anchored to the left. Compare with the original > highlighter: > http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30 > And the match has context either side: > ", Audible, <em>Apple</em> Lossless, H.264 video" > (To complicate this, in general I am not sure that the unified highlighter is > respecting the hl.fragsize parameter, although [SOLR-9935] suggests support > was added. I included the hl.fragsize param in the unified URL too, but it's > making no difference unless set to 0.) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org