yuanyun.cn created LUCENE-5381:
----------------------------------
Summary: Lucene highlighter doesn't honor hl.fragsize; it appends
all text for last fragment
Key: LUCENE-5381
URL: https://issues.apache.org/jira/browse/LUCENE-5381
Project: Lucene - Core
Issue Type: Bug
Components: modules/highlighter
Affects Versions: 4.6, 4.0
Reporter: yuanyun.cn
Priority: Minor
Fix For: 5.0, 4.7
Attachments: LUCENE-5381.patch
Recently, we hit a problem related with highlighter: I set hl.fragsize = 300,
but the highlight section for one document oupputs more than 2000 characters.
Look into the code, in
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream,
String, boolean, int), after the for loop, it appends whole remaining text
into last fragment.
if (
// if there is text beyond the last token considered..
(lastEndOffset < text.length())
&&
// and that text is not too large...
(text.length()<= maxDocCharsToAnalyze)
)
{
//append it to the last fragment
newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();
This code is problematical, as in some cases, the last fragment is the most
relevant section and will be selected to return to client.
I made some change to the code like below: It seems work for me :)
//Test what remains of the original text beyond the point where we stopped
analyzing
if(lastEndOffset < text.length())
{
if(textFragmenter instanceof SimpleFragmenter)
{
SimpleFragmenter simpleFragmenter = (SimpleFragmenter)
textFragmenter;
int remain =simpleFragmenter.getFragmentSize()
-(newText.length() - currentFrag.textStartPos);
if(remain > 0 )
{
int endIndex = lastEndOffset + remain;
if (endIndex > text.length()) {
endIndex = text.length();
}
newText.append(encoder.encodeText(text.substring(lastEndOffset,
endIndex)));
}
}
else
{
newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
}
currentFrag.textEndPos = newText.length();
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]