While the overall framework of LUCENE-663 appears similar to the current
contrib Highlighter, the code is actually quite different and I do not
think it handles as many corner cases in its current state. LUCENE-663
supports PhraseQuerys by implementing 'special' search logic that
inspects positional information to make sure the Tokens from a
PhraseQuery are in order. I am not sure how exact this logic is compared
to Lucenes PhraseQuery search logic, but a cursory look makes me think
its not complete. It almost looks to me that it only does inorder with
simple slop (not edit distance)...I am too lazy to check further though
and I may have missed something. Also, LUCENE-663 does not support Span
queries.
This patch differs in that it fits the current Highlighter framework
without modifying it, and it uses Lucene's own internal search logic to
identify Spans for highlighting. PhraseQueries are handled by a
SpanQuery approximation.
As far as PhraseQuery/SpanQuery highlighting, I don't think any of the
other Highlighter packages offer much. I think that things could be done
a little faster, but that would require abandoning the current
framework, and with all of the corner cases it now supports, I'd hate to
see that.
The other Highlighter code that is worth consideration is LUCENE-644. It
does abandon the current Highlighter framework and goes with an attack
that is much more efficient for larger documents: instead of attacking
the problem by spinning through all of the document tokens and comparing
to query tokens, 644 just looks at the tokens from the query and grabs
the original text using the offsets from those tokens. This is darn
fast, but doesnt go well with positional highlighting and I wonder how
well it supports all of the corner cases that arise with overlapping
tokens and whatnot.
- Mark
Grant Ingersoll (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558784#action_12558784 ]
Grant Ingersoll commented on LUCENE-794:
----------------------------------------
How should this relate to LUCENE-663? Seems like that one also covers other
kinds of queries? I'm no expert in highlighting, but it seems like there is at
least 3 different issues in JIRA for enabling things like phrase queries, etc.
Should we try to consolidate these?
Extend contrib Highlighter to properly support phrase queries and span queries
------------------------------------------------------------------------------
Key: LUCENE-794
URL: https://issues.apache.org/jira/browse/LUCENE-794
Project: Lucene - Java
Issue Type: Improvement
Components: Other
Reporter: Mark Miller
Priority: Minor
Attachments: spanhighlighter.patch, spanhighlighter10.patch,
spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch,
spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch,
spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch,
spanhighlighter_patch_4.zip
This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package
that scores just like QueryScorer, but scores a 0 for Terms that did not cause
the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys
and PhraseQuery. There is also a new Fragmenter that attempts to fragment
without breaking up Spans.
See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
There is a dependency on MemoryIndex.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]