[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682609#action_12682609 ]
Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} > It'd be sort of like a positional-aware "explain", ie "show me the term > occurrences that allowed the full query to accept this document". FWIW, this is more or less how the KinoSearch highlighter now works in svn trunk. It doesn't use a Scorer, though, but instead the KS analogue to Lucene's "Weight" class. The (Weight) is fed what is essentially a single doc index, using stored term vectors. Weight.highlightSpans() returns an array of "span" objects, each of which has a start offset, a length, and a score. The Highlighter then processes these span objects to create a "heat map" and choose its excerpt points. The idea is that by delegating responsibility for creating the scoring spans, we make it easier to support arbitrary Query implementations with a single Highlighter class. {quote} Awesome! Do you require term vectors to be stored, for highlighting (cannot re-analyze the text)? For queries that normally do not use positions at all (simple AND/OR of terms), how does your highlightSpans() work? For BooleanQuery, is coord factor used to favor fragment sets that include more unique terms? Are you guaranteed to always present a net set of fragments that "matches" the query? (eg the example query above). I think the base litmus test for a hightlighter is: if one were to take all fragments presented for a document (call this a "fragdoc") and make a new document from it, would that document match the original query? In fact, I think the perfect highlighter would "logically" work as follows: take a single document and enumerate every single possible fragdoc. Each fragdoc is allowed to have maxNumFragments fragments, where each fragment has a min/max number of characters. The set of fragdocs is of course ridiculously immense. Take this massive collection of fragdocs and build a new temporary index, then run your Query against that index. Many of the fragdocs would not match the Query, so they are eliminated right off (this is the litmus test). Then, of the ones that do, you want the highest scoring fragdocs. Obviously you can't actually implement a highlighter like that, but I think "logically" that is the optimal highlighter that we are trying to emulate with more efficient implementations. I think having the Query/Weight/Scorer class be the single-source for hits, explanation & highlight spans is the right approach. Having a whole separate package trying to reverse-engineer where matches had taken place between Query and Document is hard to get right. EG BooleanScorer2's coord factor would naturally/correctly influence the selection. I also think building a [reduced, just Postings] IndexReader API on top of TermVectors ought to be a simple way to get great performance here. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org