[jira] Commented: (LUCENE-644) Contrib: another highlighter approach

Mark Harwood (JIRA) Wed, 02 Aug 2006 03:54:47 -0700

    [ 
http://issues.apache.org/jira/browse/LUCENE-644?page=comments#action_12425189 ] 
            
Mark Harwood commented on LUCENE-644:
-------------------------------------


Hi Ronnie,
Thanks for the contribution but I'm not sure I follow the justification for 
producing this code. Could it be because you assume the existing highlighter 
still requires an Analyzer to obtain a list of tokens? For a while now the 
highlighter has taken a TokenStream (as an alternative argument to Analyzer) in 
order to allow for faster sources of tokenized data. 
The TokenSources class from which you took some of the code was specifically 
introduced to offer the ability to quickly create TokenStreams from 
TermPositionVectors however I notice that the JUnit tests don't contain an 
example of using it -  that should be added.

Is there something else in your contribution/thinking here that I have missed?



Cheers,
Mark

> Contrib: another highlighter approach
> -------------------------------------
>
>                 Key: LUCENE-644
>                 URL: http://issues.apache.org/jira/browse/LUCENE-644
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>            Reporter: Ronnie Kolehmainen
>            Priority: Minor
>         Attachments: FulltextHighlighter.java, FulltextHighlighterTest.java, 
> svn-diff.patch
>
>
> Mark Harwoods highlighter package is a great contribution to Lucene, I've 
> used it a lot! However, when you have *large* documents (fields), 
> highlighting can be quite time consuming if you increase the number of bytes 
> to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is 
> often too low for indexed PDFs etcetera, which results in empty highlight 
> strings.
> This is an alternative approach using term position vectors only to build 
> fragment info objects. Then a StringReader can read the relevant fragments 
> and skip() between them. This is a lot faster. Also, this method uses the 
> *entire* field for finding the best fragments so you're always guaranteed to 
> get a highlight snippet.
> Because this method only works with fields which have term positions stored 
> one can check if this method works for a particular field using following 
> code (taken from TokenSources.java):
>         TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, 
> field);
>         if (tfv != null && tfv instanceof TermPositionVector)
>         {
>           // use FulltextHighlighter
>         }
>         else
>         {
>           // use standard Highlighter
>         }
> Someone else might find this useful so I'm posting the code here.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-644) Contrib: another highlighter approach

Reply via email to