[ http://issues.apache.org/jira/browse/LUCENE-644?page=comments#action_12425230 ] Mark Harwood commented on LUCENE-644: -------------------------------------
Many thanks for the client code Ronnie - I have tried it with my index and have reproduced the speed-up. I'm keen to integrate any code that offers a speed-up and ideally in such a way so that we have one highlighter + Junit test rig which can work with indexes with TermPositionVectors and also those without. This I suspect will involve merging bits of our code. There are a lot of test cases with strange analyzers that need to be considered so that's why I'm keen to have one codebase. I'm disappearing on 2 weeks holiday (vacation) shortly so haven't got a lot of time to look at this right now but I plan to when I get back. After a quick look I haven't yet identified the difference between your approach and mine which offers the speed-up. One factor is likely that your code only considers offset positions of tokens that are actually query terms and that may be something I could retrofit into TokenSources to produce TokenStreams of only the important tokens to the highlighter. I suspect there are other benefits to be had from your code too though which I'll have to consider when I have more time. Thanks again for this, Cheers Mark > Contrib: another highlighter approach > ------------------------------------- > > Key: LUCENE-644 > URL: http://issues.apache.org/jira/browse/LUCENE-644 > Project: Lucene - Java > Issue Type: Improvement > Components: Other > Reporter: Ronnie Kolehmainen > Priority: Minor > Attachments: FulltextHighlighter.java, FulltextHighlighterTest.java, > svn-diff.patch > > > Mark Harwoods highlighter package is a great contribution to Lucene, I've > used it a lot! However, when you have *large* documents (fields), > highlighting can be quite time consuming if you increase the number of bytes > to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is > often too low for indexed PDFs etcetera, which results in empty highlight > strings. > This is an alternative approach using term position vectors only to build > fragment info objects. Then a StringReader can read the relevant fragments > and skip() between them. This is a lot faster. Also, this method uses the > *entire* field for finding the best fragments so you're always guaranteed to > get a highlight snippet. > Because this method only works with fields which have term positions stored > one can check if this method works for a particular field using following > code (taken from TokenSources.java): > TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, > field); > if (tfv != null && tfv instanceof TermPositionVector) > { > // use FulltextHighlighter > } > else > { > // use standard Highlighter > } > Someone else might find this useful so I'm posting the code here. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]