[ 
http://issues.apache.org/jira/browse/LUCENE-644?page=comments#action_12425230 ] 
            
Mark Harwood commented on LUCENE-644:
-------------------------------------

Many thanks for the client code Ronnie - I have tried it with my index and have 
reproduced the speed-up. 
I'm keen to integrate any code that offers a speed-up and ideally in such a way 
so that we have one highlighter + Junit test rig which can work with indexes 
with TermPositionVectors and also those without. This I suspect will involve 
merging bits of our code. There are a lot of test cases with strange analyzers 
that need to be considered so that's why I'm keen to have one codebase.

I'm disappearing on 2 weeks holiday (vacation) shortly so haven't got a lot of 
time to look at this right now but I plan to when I get back.

After a quick look I haven't yet identified the difference between your 
approach and mine which offers the speed-up. One factor is likely that your 
code only considers offset positions of tokens that are actually query terms 
and that may be something I could retrofit into TokenSources to produce 
TokenStreams of only the important tokens to the highlighter.
I suspect there are other benefits to be had from your code too though which 
I'll have to consider when I have more time.

Thanks again for this,

Cheers
Mark

> Contrib: another highlighter approach
> -------------------------------------
>
>                 Key: LUCENE-644
>                 URL: http://issues.apache.org/jira/browse/LUCENE-644
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>            Reporter: Ronnie Kolehmainen
>            Priority: Minor
>         Attachments: FulltextHighlighter.java, FulltextHighlighterTest.java, 
> svn-diff.patch
>
>
> Mark Harwoods highlighter package is a great contribution to Lucene, I've 
> used it a lot! However, when you have *large* documents (fields), 
> highlighting can be quite time consuming if you increase the number of bytes 
> to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is 
> often too low for indexed PDFs etcetera, which results in empty highlight 
> strings.
> This is an alternative approach using term position vectors only to build 
> fragment info objects. Then a StringReader can read the relevant fragments 
> and skip() between them. This is a lot faster. Also, this method uses the 
> *entire* field for finding the best fragments so you're always guaranteed to 
> get a highlight snippet.
> Because this method only works with fields which have term positions stored 
> one can check if this method works for a particular field using following 
> code (taken from TokenSources.java):
>         TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, 
> field);
>         if (tfv != null && tfv instanceof TermPositionVector)
>         {
>           // use FulltextHighlighter
>         }
>         else
>         {
>           // use standard Highlighter
>         }
> Someone else might find this useful so I'm posting the code here.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to