[jira] [Comment Edited] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

Jira Mon, 23 Dec 2019 16:03:37 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002548#comment-17002548
 ]


Nándor Mátravölgyi edited comment on LUCENE-9093 at 12/24/19 12:02 AM:
-----------------------------------------------------------------------

I could look into making this a github PR tomorrow... I'll change the default 
fragalign to 0.5 as well.

It also works in SENTENCE mode, but the results won't be as accurate in some 
cases. Let me elaborate.

In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes 
the decision on where a slice can happen. The first slice always contains the 
match. The LengthGoalBreakIterator will decide which side of the first slice 
should the selected BI add more slices to. The logic is generic and will work 
regardless of the underlying BI. Since the snippet will be grown until it 
reaches fragsize, the size of the last slice to be added will determine how big 
to overshoot is. Examples in SENTENCE mode:

Example text: _Hello Susan! I cannot believe the weather is unreal again! The 
sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not 
panic._
 # If the fragsize is smaller than the first slice (sentence in this case), no 
expansion will happen in either direction. Note that fragalign is N/A in this 
case.

{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=10 makes snippet length of 17
The <b>sky</b> is green.{noformat}

 # If the fragsize is bigger than the first slice and the fragalign is 0.5, the 
slice will be expanded on the left first and then on the right if any space is 
left.

{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.

q=sky&hl.fragalign=0.5&hl.fragsize=80 makes snippet length of 119
I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope 
Mrs Smith will bring an umbrella for the picnic.

q=sky&hl.fragalign=0.5&hl.fragsize=120 makes snippet length of 132
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is 
green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat}
 # If the fragsize is bigger than the first slice and the fragalign is 0, the 
slice will be expanded on the right only. (the match is anchored to 
0/left/begin)

{noformat}
q=sky&hl.fragalign=0.0&hl.fragsize=30 makes snippet length of 73
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic.

q=sky&hl.fragalign=0.0&hl.fragsize=80 makes snippet length of 90
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the 
picnic. Let's not panic.{noformat}
 # If the fragsize is bigger than the first slice and the fragalign is 1, the 
slice will be expanded on the left only. (the match is anchored to 1/right/end)

{noformat}
q=sky&hl.fragalign=1.0&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.

q=sky&hl.fragalign=1.0&hl.fragsize=70 makes snippet length of 76
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is 
green.{noformat}
In the above examples there are big overshoots of the fragsize. 63 instead of 
30 (+110%) and 119 instead of 80 (+49%). These would also occur if the 
fragalign would be 0.1, but the alignment would be even less accurate in cases 
where the left expansion overshoots:
{noformat}
q=sky&hl.fragalign=0.1&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat}
This is because the order of expansion is strictly left first. I guess this 
could be improved if so desired.

In summary, to ensure the accuracy of fragsize & fragalign parameters, they 
have to be proportional to the approximate size of the slices. Here's how the 
worst expected overshoot can be calculated:
{noformat}
float WorstOvershootPercent(float fragsize, float avgSliceLength) {
    return ((((fragsize-1)+avgSliceLength) / fragsize)-1)*100;
}

WORD: (words are usually 12-25 characters most)
WorstOvershootPercent(15, 12)    =>  73.34%
WorstOvershootPercent(100, 25)   =>  24.00%
WorstOvershootPercent(300, 25)   =>   8.00%

SENTENCE: (a sentence can be very long)
WorstOvershootPercent(300, 300)  =>  99.66%
WorstOvershootPercent(300, 500)  => 166.34%
WorstOvershootPercent(2000, 300) =>  14.95%
WorstOvershootPercent(2000, 500) =>  24.95%{noformat}
The other highlighters have similar rules for this. The only thing that can 
improve this easily in some cases, is to search the closest length to the 
fragsize instead of the minimum. The LengthGoalBreakIterator has a 
closestTo-mode, but it's not usable because it would require yet another 
parameter. ([view on 
github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330])

Using that mode could make an undershoot that is closer to the desired size 
than the overshoot.


was (Author: myusername8):
I could look into making this a github PR tomorrow... I'll change the default 
fragalign to 0.5 as well.

It also works in SENTENCE mode, but the results won't be as accurate in some 
cases. Let me elaborate.

In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes 
the decision on where a slice can happen. The first slice always contains the 
match. The LengthGoalBreakIterator will decide which side of the first slice 
should the selected BI add more slices to. The logic is generic and will work 
regardless of the underlying BI. Since the snippet will be grown until it 
reaches fragsize, the size of the last slice to be added will determine how big 
to overshoot is. Examples in SENTENCE mode:

Example text: _Hello Susan! I cannot believe the weather is unreal again! The 
sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not 
panic._
 # If the fragsize is smaller than the first slice (sentence in this case), no 
expansion will happen in either direction. Note that fragalign is N/A in this 
case.

{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=10 makes snippet length of 17
The <b>sky</b> is green.{noformat}

 # If the fragsize is bigger than the first slice and the fragalign is 0.5, the 
slice will be expanded on the left first and then on the right if any space is 
left.

{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.

q=sky&hl.fragalign=0.5&hl.fragsize=80 makes snippet length of 119
I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope 
Mrs Smith will bring an umbrella for the picnic.

q=sky&hl.fragalign=0.5&hl.fragsize=120 makes snippet length of 132
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is 
green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat}

 # If the fragsize is bigger than the first slice and the fragalign is 0, the 
slice will be expanded on the right only. (the match is anchored to 
0/left/begin)

{noformat}
q=sky&hl.fragalign=0.0&hl.fragsize=30 makes snippet length of 73
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic.

q=sky&hl.fragalign=0.0&hl.fragsize=80 makes snippet length of 90
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the 
picnic. Let's not panic.{noformat}

 # If the fragsize is bigger than the first slice and the fragalign is 1, the 
slice will be expanded on the left only. (the match is anchored to 1/right/end)

{noformat}
q=sky&hl.fragalign=1.0&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.

q=sky&hl.fragalign=1.0&hl.fragsize=70 makes snippet length of 76
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is 
green.{noformat}

In the above examples there are big overshoots of the fragsize. 63 instead of 
30 (+110%) and 119 instead of 80 (+49%). These would also occur if the 
fragalign would be 0.1, but the alignment would be even less accurate in cases 
where the left expansion overshoots:
{noformat}
q=sky&hl.fragalign=0.1&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat}
This is because the order of expansion is strictly left first. I guess this 
could be improved if so desired.

In summary, to ensure the accuracy of fragsize & fragalign parameters, they 
have to be proportional to the approximate size of the slices. Here's how the 
worst expected overshoot can be calculated:
{noformat}
float WorstOvershootPercent(float fragsize, float avgSliceLength) {
    return ((((fragsize-1)+avgSliceLength) / fragsize)-1)*100;
}

WORD: (words are usually 12-25 characters most)
WorstOvershootPercent(15, 12)    =>  73.34%
WorstOvershootPercent(100, 25)   =>  24.00%
WorstOvershootPercent(300, 25)   =>   8.00%

SENTENCE: (a sentence can be very long)
WorstOvershootPercent(300, 300)  =>  99.66%
WorstOvershootPercent(300, 500)  => 166.34%
WorstOvershootPercent(2000, 300) =>  14.95%
WorstOvershootPercent(2000, 500) =>  24.95%{noformat}
The other highlighters have similar rules for this. The only thing that can 
improve this easily in some cases, is to search the closest length to the 
fragsize instead of the minimum. The LengthGoalBreakIterator has a 
closestTo-mode, but it's not usable because it would require yet another 
parameter. ([view on 
github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330])

Using that mode could make an undershoot that is closer to the desired size 
than the overshoot.

> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9093
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9093
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Tim Retout
>            Priority: Major
>         Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

Reply via email to