[jira] [Updated] (LUCENE-5381) Lucene highlighter doesn't honor hl.fragsize; it appends all text for last fragment

yuanyun.cn (JIRA) Wed, 01 Jan 2014 07:25:44 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


yuanyun.cn updated LUCENE-5381:
-------------------------------

    Description: 
Recently, we hit a problem related with highlighter: I set hl.fragsize = 300, 
but the highlight section for one document outputs more than 2000 characters.

Look into the code, in 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream,
 String, boolean, int),  after the for loop, it appends whole remaining text 
into last fragment.
if (
                // if there is text beyond the last token considered..
                (lastEndOffset < text.length())
                &&
                // and that text is not too large...
                (text.length()<= maxDocCharsToAnalyze)
        )
{
        //append it to the last fragment
        newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();

This code is problematical, as in some cases, the last fragment is the most 
relevant section and will be selected to return to client.

I made some change to the code like below:  It seems work for me :)
//Test what remains of the original text beyond the point where we stopped 
analyzing
if(lastEndOffset < text.length())
{
        if(textFragmenter instanceof SimpleFragmenter)
        {
                SimpleFragmenter simpleFragmenter = (SimpleFragmenter) 
textFragmenter;
                int remain =simpleFragmenter.getFragmentSize() 
-(newText.length() - currentFrag.textStartPos);
                if(remain > 0 )
                {
                        int endIndex = lastEndOffset + remain;
                        if (endIndex > text.length()) {
                                endIndex = text.length();
                        }
                        
newText.append(encoder.encodeText(text.substring(lastEndOffset,
                                        endIndex)));
                }
        }
        else
        {
                
newText.append(encoder.encodeText(text.substring(lastEndOffset)));
        }
}
currentFrag.textEndPos = newText.length();

  was:
Recently, we hit a problem related with highlighter: I set hl.fragsize = 300, 
but the highlight section for one document oupputs more than 2000 characters.

Look into the code, in 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream,
 String, boolean, int),  after the for loop, it appends whole remaining text 
into last fragment.
if (
                // if there is text beyond the last token considered..
                (lastEndOffset < text.length())
                &&
                // and that text is not too large...
                (text.length()<= maxDocCharsToAnalyze)
        )
{
        //append it to the last fragment
        newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();

This code is problematical, as in some cases, the last fragment is the most 
relevant section and will be selected to return to client.

I made some change to the code like below:  It seems work for me :)
//Test what remains of the original text beyond the point where we stopped 
analyzing
if(lastEndOffset < text.length())
{
        if(textFragmenter instanceof SimpleFragmenter)
        {
                SimpleFragmenter simpleFragmenter = (SimpleFragmenter) 
textFragmenter;
                int remain =simpleFragmenter.getFragmentSize() 
-(newText.length() - currentFrag.textStartPos);
                if(remain > 0 )
                {
                        int endIndex = lastEndOffset + remain;
                        if (endIndex > text.length()) {
                                endIndex = text.length();
                        }
                        
newText.append(encoder.encodeText(text.substring(lastEndOffset,
                                        endIndex)));
                }
        }
        else
        {
                
newText.append(encoder.encodeText(text.substring(lastEndOffset)));
        }
}
currentFrag.textEndPos = newText.length();


> Lucene highlighter doesn't honor hl.fragsize; it appends all text for last 
> fragment
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-5381
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5381
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 4.0, 4.6
>            Reporter: yuanyun.cn
>            Priority: Minor
>              Labels: highlighter, lucene
>             Fix For: 5.0, 4.7
>
>         Attachments: LUCENE-5381.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Recently, we hit a problem related with highlighter: I set hl.fragsize = 300, 
> but the highlight section for one document outputs more than 2000 characters.
> Look into the code, in 
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream,
>  String, boolean, int),  after the for loop, it appends whole remaining text 
> into last fragment.
> if (
>               // if there is text beyond the last token considered..
>               (lastEndOffset < text.length())
>               &&
>               // and that text is not too large...
>               (text.length()<= maxDocCharsToAnalyze)
>       )
> {
>       //append it to the last fragment
>       newText.append(encoder.encodeText(text.substring(lastEndOffset)));
> }
> currentFrag.textEndPos = newText.length();
> This code is problematical, as in some cases, the last fragment is the most 
> relevant section and will be selected to return to client.
> I made some change to the code like below:  It seems work for me :)
> //Test what remains of the original text beyond the point where we stopped 
> analyzing
> if(lastEndOffset < text.length())
> {
>       if(textFragmenter instanceof SimpleFragmenter)
>       {
>               SimpleFragmenter simpleFragmenter = (SimpleFragmenter) 
> textFragmenter;
>               int remain =simpleFragmenter.getFragmentSize() 
> -(newText.length() - currentFrag.textStartPos);
>               if(remain > 0 )
>               {
>                       int endIndex = lastEndOffset + remain;
>                       if (endIndex > text.length()) {
>                               endIndex = text.length();
>                       }
>                       
> newText.append(encoder.encodeText(text.substring(lastEndOffset,
>                                       endIndex)));
>               }
>       }
>       else
>       {
>               
> newText.append(encoder.encodeText(text.substring(lastEndOffset)));
>       }
> }
> currentFrag.textEndPos = newText.length();



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5381) Lucene highlighter doesn't honor hl.fragsize; it appends all text for last fragment

Reply via email to