[ 
https://issues.apache.org/jira/browse/LUCENE-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070066#comment-13070066
 ] 

Mike Sokolov commented on LUCENE-3318:
--------------------------------------

Uploading a patch for this that builds a TokenStream using position intervals 
from the query (matches) and their offsets, and then uses the existing 
Highlighter to do fragmentation and markup.

This approach should make it easy to skip creating fragments for interstitial 
(non-matching) portions of large documents, but this issue doesn't cover that 
yet.

The patch provides two methods for mapping positions to offsets; one is based 
on term vectors; the other uses offsets stored in payloads.  And you can still 
use analysis. The payload version is about twice as fast as the term vector 
version, which is around 8x faster than reanalysis (comparable to 
FastVectorHighlighter).  The choice of which to use (or whether to re-analyze) 
is up to the user; there are no auto-fallback behaviors in here :)

Using these schemes makes fragmentation more difficult. The issue is that 
offsets and positions are not readily available for all tokens - only for those 
that matched the query.  This makes it harder to fragment the document in 
reasonable places, and to surround the hits with some appropriate text. 
However, the substantial speedup seems to make it worth the effort.

Some TODO's:

There's currently no consistency-checking: if no offset-payloads were stored, 
and the user attempts to use them, they simply get no highlighting. I think 
there may be a hard fail if absent term vectors are requested though.

Fragmentation doesn't necessarily land on a good boundary; we should at least 
scan for whitespace in a default fragmenter.

Simon: something a bit weird happens when collecting position intervals now; in 
some cases the same interval can be collected twice.  This happens 
w/ConjunctionPositionIterator - when PosCollector.collect(doc) calls 
advanceTo(doc), positions are collected, and then I iterate over more 
positions, and collect() them (which I have to do to get other cases to work); 
then during this latter iteration, the same intervals are reported again.  I've 
worked around this easily enough, but I think it would be easier to work with 
if it didn't happen?  Not sure how difficult that is to arrange.

Also: I made all the collect() methods throw IOException so I could report 
exceptions from processing payloads.


> Sketch out highlighting based on term positions / position iterators
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3318
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3318
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: modules/highlighter
>    Affects Versions: Positions Branch
>            Reporter: Simon Willnauer
>            Assignee: Mike Sokolov
>             Fix For: Positions Branch
>
>         Attachments: LUCENE-3318.patch
>
>
> Spinn off from LUCENE-2878. Since we have positions on a large number of 
> queries already in the branch is worth looking at highlighting as a real 
> consumer of the API. A prototype is already committed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to