Contrib: another highlighter approach
-------------------------------------

                 Key: LUCENE-644
                 URL: http://issues.apache.org/jira/browse/LUCENE-644
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Other
            Reporter: Ronnie Kolehmainen
            Priority: Minor
         Attachments: FulltextHighlighter.java

Mark Harwoods highlighter package is a great contribution to Lucene, I've used 
it a lot! However, when you have *large* documents (fields), highlighting can 
be quite time consuming if you increase the number of bytes to analyze with 
setMaxDocBytesToAnalyze(int). The default value of 50k is often too low for 
indexed PDFs etcetera, which results in empty highlight strings.

This is an alternative approach using term position vectors only to build 
fragment info objects. Then a StringReader can read the relevant fragments and 
skip() between them. This is a lot faster. Also, this method uses the *entire* 
field for finding the best fragments so you're always guaranteed to get a 
highlight snippet.

Because this method only works with fields which have term positions stored one 
can check if this method works for a particular field using following code 
(taken from TokenSources.java):

        TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, 
field);
        if (tfv != null && tfv instanceof TermPositionVector)
        {
          // use FulltextHighlighter
        }
        else
        {
          // use standard Highlighter
        }

Someone else might find this useful so I'm posting the code here.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to