[ https://issues.apache.org/jira/browse/LUCENE-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070066#comment-13070066 ]
Mike Sokolov commented on LUCENE-3318: -------------------------------------- Uploading a patch for this that builds a TokenStream using position intervals from the query (matches) and their offsets, and then uses the existing Highlighter to do fragmentation and markup. This approach should make it easy to skip creating fragments for interstitial (non-matching) portions of large documents, but this issue doesn't cover that yet. The patch provides two methods for mapping positions to offsets; one is based on term vectors; the other uses offsets stored in payloads. And you can still use analysis. The payload version is about twice as fast as the term vector version, which is around 8x faster than reanalysis (comparable to FastVectorHighlighter). The choice of which to use (or whether to re-analyze) is up to the user; there are no auto-fallback behaviors in here :) Using these schemes makes fragmentation more difficult. The issue is that offsets and positions are not readily available for all tokens - only for those that matched the query. This makes it harder to fragment the document in reasonable places, and to surround the hits with some appropriate text. However, the substantial speedup seems to make it worth the effort. Some TODO's: There's currently no consistency-checking: if no offset-payloads were stored, and the user attempts to use them, they simply get no highlighting. I think there may be a hard fail if absent term vectors are requested though. Fragmentation doesn't necessarily land on a good boundary; we should at least scan for whitespace in a default fragmenter. Simon: something a bit weird happens when collecting position intervals now; in some cases the same interval can be collected twice. This happens w/ConjunctionPositionIterator - when PosCollector.collect(doc) calls advanceTo(doc), positions are collected, and then I iterate over more positions, and collect() them (which I have to do to get other cases to work); then during this latter iteration, the same intervals are reported again. I've worked around this easily enough, but I think it would be easier to work with if it didn't happen? Not sure how difficult that is to arrange. Also: I made all the collect() methods throw IOException so I could report exceptions from processing payloads. > Sketch out highlighting based on term positions / position iterators > -------------------------------------------------------------------- > > Key: LUCENE-3318 > URL: https://issues.apache.org/jira/browse/LUCENE-3318 > Project: Lucene - Java > Issue Type: Sub-task > Components: modules/highlighter > Affects Versions: Positions Branch > Reporter: Simon Willnauer > Assignee: Mike Sokolov > Fix For: Positions Branch > > Attachments: LUCENE-3318.patch > > > Spinn off from LUCENE-2878. Since we have positions on a large number of > queries already in the branch is worth looking at highlighting as a real > consumer of the API. A prototype is already committed. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org