Hi all,
I'm having an issue when highlighting fields that have overlapping tokens.
There was a bug opened in Jira some year ago
https://issues.apache.org/jira/browse/LUCENE-627 but I'm a bit confused about
this. In jira bug's status is "resolved", but still I got the exact same
problem with a genuine lucene 2.9.3.
Looking for what was going on, I checked
org.apache.lucene.search.highlight.TokenSources that rebuilds a tokenStream
from TermVectors and I found that token where not sorted by offset, as one
would expect.
When sorting tokens, the following comparer is used :
public int compare(Object o1, Object o2)
{
Token t1=(Token) o1;
Token t2=(Token) o2;
if(t1.startOffset()>t2.endOffset())
return 1;
if(t1.startOffset()<t2.startOffset())
return -1;
return 0;
}
I'm not sure why endOffset is used instead of startOffset in first test (looks
like a typo), and with non-overlapping token this works just fine.
But with overlapping tokens longest token get pushed to the end of their
"overlapping zone" : (big,3,6), (fish,7,11), ({big fish},3,11) would end up
sorted in this exact order, where I would have expected (big,3,6) ({big
fish},3,11) (fish,7,11) or ({big fish},3,11) (big,3,6) (fish,7,11).
Highligthing with the term "{big fish}" builds a fragment by concatenating
"big", "{big fish}", and "fish", giving this phrase : "big<em>big fish</em>
fish".
I tested a quick fix by having preceding comparer changed like this :
public int compare(Object o1, Object o2)
{
Token t1 = (Token)o1;
Token t2 = (Token)o2;
if (t1.startOffset() > t2.startOffset())
return 1;
if (t1.startOffset() < t2.startOffset())
return -1;
if (t1.endOffset() < t2.endOffset())
return -1;
if (t1.endOffset() > t2.endOffset())
return 1;
return 0;
}
Highlight behavior is now correct as far as I tested it.
Maybe the original sorting order has a purpose I don't understand, but to me
this slight modification seams to fix everything. What should I do ? (I'm very
new to this list and this community).
If someone with better understanding of lucene highlight could give me some
feedback, I would be grateful.
Thanks for your time.
Pierre
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]