On Jan 14, 2008, at 4:49 PM, Mark Miller wrote:
While the overall framework of LUCENE-663 appears similar to the
current contrib Highlighter, the code is actually quite different
and I do not think it handles as many corner cases in its current
state. LUCENE-663 supports PhraseQuerys by implementing 'special'
search logic that inspects positional information to make sure the
Tokens from a PhraseQuery are in order. I am not sure how exact this
logic is compared to Lucenes PhraseQuery search logic, but a cursory
look makes me think its not complete. It almost looks to me that it
only does inorder with simple slop (not edit distance)...I am too
lazy to check further though and I may have missed something. Also,
LUCENE-663 does not support Span queries.
This patch differs in that it fits the current Highlighter framework
without modifying it, and it uses Lucene's own internal search logic
to identify Spans for highlighting. PhraseQueries are handled by a
SpanQuery approximation.
As far as PhraseQuery/SpanQuery highlighting, I don't think any of
the other Highlighter packages offer much. I think that things could
be done a little faster, but that would require abandoning the
current framework, and with all of the corner cases it now supports,
I'd hate to see that.
The other Highlighter code that is worth consideration is
LUCENE-644. It does abandon the current Highlighter framework and
goes with an attack that is much more efficient for larger
documents: instead of attacking the problem by spinning through all
of the document tokens and comparing to query tokens, 644 just looks
at the tokens from the query and grabs the original text using the
offsets from those tokens. This is darn fast, but doesnt go well
with positional highlighting and I wonder how well it supports all
of the corner cases that arise with overlapping tokens and whatnot.
Hmm, I'm beginning to think that the performance issue may be overcome
to some extent with the new TermVectorMapper stuff. Basic idea is
that you construct a highlighter that does the appropriate
highlighting as the TV is being loaded from disk through the Map
function. This would save having to go back through all the tokens a
second time, but probably has other issues. It's just a thought in my
head at this point. At a minimum, I think the TVM could speed up the
TokenSources part that creates the TokenStream based on the TermVector.
At any rate, I am going to think some more on it.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]