[ http://issues.apache.org/jira/browse/LUCENE-644?page=all ]

Ronnie Kolehmainen updated LUCENE-644:
--------------------------------------

    Attachment: TokenSources.java
                TokenSources.java.diff

Mark, I played around a bit with the code in TokenSources and added a method 
which takes an optional Query object. It will return a 
TokenSources.BestFragmentsTokenStream (which could later be identified by 
Highlighter if needed) when the query is not null and term positions are 
available. This token stream only holds the highlighted tokens and their 
surroundings.

These changes shave off about 50% of the time but is still two times slower 
than my first example. Also, the fragments don't look as expected. The terms 
are highlighted but the surrounding tokens are most often missing. I'm not sure 
why as I haven't dug deeper in the HighLighter code. The tokens returned by 
TokenSources look fine though, with positions and in order...

It would certainly be nice if most changes could be made in TokenSources. This 
would allow HighLigheter to be flexible as it is today, with Scorers and 
Formatters.

I won't have time to look at it anymore, at least for a while, but I'm posting 
my version of TokenSources and a diff against current trunk here in case you 
want to have a peek at it later.


> Contrib: another highlighter approach
> -------------------------------------
>
>                 Key: LUCENE-644
>                 URL: http://issues.apache.org/jira/browse/LUCENE-644
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>            Reporter: Ronnie Kolehmainen
>            Priority: Minor
>         Attachments: FulltextHighlighter.java, FulltextHighlighterTest.java, 
> svn-diff.patch, TokenSources.java, TokenSources.java.diff
>
>
> Mark Harwoods highlighter package is a great contribution to Lucene, I've 
> used it a lot! However, when you have *large* documents (fields), 
> highlighting can be quite time consuming if you increase the number of bytes 
> to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is 
> often too low for indexed PDFs etcetera, which results in empty highlight 
> strings.
> This is an alternative approach using term position vectors only to build 
> fragment info objects. Then a StringReader can read the relevant fragments 
> and skip() between them. This is a lot faster. Also, this method uses the 
> *entire* field for finding the best fragments so you're always guaranteed to 
> get a highlight snippet.
> Because this method only works with fields which have term positions stored 
> one can check if this method works for a particular field using following 
> code (taken from TokenSources.java):
>         TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, 
> field);
>         if (tfv != null && tfv instanceof TermPositionVector)
>         {
>           // use FulltextHighlighter
>         }
>         else
>         {
>           // use standard Highlighter
>         }
> Someone else might find this useful so I'm posting the code here.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to