I just meant whether it would live in a lucene release (somewhere
under contrib/) or just in JIRA. Would including the functionality
in Solr help get it into lucene?
-Mike
On 29-Aug-07, at 4:58 AM, Mark Miller wrote:
It kind of is a contrib -- its really just a new Scorer class (with
some axillary helper classes) for the old contrib Highlighter.
Since the contrib Highlighter is pretty hardened at this point, I
figured that was the best way to go. Or do you mean something
different?
- Mark
Mike Klaas wrote:
Mark,
I'm still interested in integrating this into Solr--this is a
feature that has been requested a few times. It would be easier
to do so if it were a contrib/...
thanks for the great work,
-Mike
On 27-Aug-07, at 4:21 AM, Mark Miller wrote:
I am a bit unclear about your question. The patch you mention
extends the original Highlighter to support phrase and span
queries. It does not include any major performance increases over
the original Highlighter (in fact, it takes a bit longer to
Highlight a Span or Phrase query than it does to just highlight
Terms).
Will it be released with the next version of Lucene? Doesn't look
like it, but anything is possible. A few people are using it, but
there has not been widespread interest that I have seen. My guess
is that there are just not enough people trying to highlight Span
queries -- which I'd blame on a lack of Span support in the
default Lucene Query syntax.
Whether it is included soon or not, the code works well and I
will continue to support it.
- Mark
Michael Stoppelman wrote:
Is this jar going to be in the next release of lucene? Also, are
these the
same as the changes in the following patch:
https://issues.apache.org/jira/secure/attachment/12362653/
spanhighlighter10.patch
-M
On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:
I have not looked at any highlighting code yet. Is there
already an
extension
of PhraseQuery that has getSpans() ?
Currently I am using this code originally by M. Harwood:
Term[] phraseQueryTerms = ((PhraseQuery)
query).getTerms();
int i;
SpanQuery[] clauses = new SpanQuery
[phraseQueryTerms.length];
for (i = 0; i < phraseQueryTerms.length; i++) {
clauses[i] = new SpanTermQuery(phraseQueryTerms
[i]);
}
SpanNearQuery sp = new SpanNearQuery(clauses,
((PhraseQuery) query).getSlop(), false);
sp.setBoost(query.getBoost());
I don't think it is perfect logic for PhraseQuery's edit
distance, but
it approximates extremely well in most cases.
I wonder if this approach to Highlighting would be worth it in
the end.
Certainly, it would seem to require that you store offsets or
you would
have to re-tokenize anyway.
Some more interesting "stuff" on the current Highlighter methods:
We can gain a lot of speed on the implementation of the current
Highlighter if we grab from the source text in bigger chunks.
Ronnie's
Highlighter appears to be faster than the original due to two
things: he
doesn't have to re-tokenize text and he rebuilds the original
document
in large pieces. Depending on how you want to look at it, he
loses most
of the speed gained from just looking at the Query tokens
instead of all
tokens to pulling the Term offset information (which appears
pretty slow).
If you use a SimpleAnalyzer on docs around 1800 tokens long,
you can
actually match the speed of Ronnies highlighter with the current
highlighter if you just rebuild the highlighted documents in
bigger
pieces i.e. instead of going through each token and adding the
source
text that it covers, build up the offset information until you get
another hit and then pull from the source text into the
highlighted text
in one big piece rather than a tokens worth at a time. Of
course this is
not compatible with the way the Fragmenter currently works. If
you use
the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's
highlighter
wins because it takes so darn long to re-analyze.
It is also interesting to note that it is very difficult to see
in a
gain in using TokenSources to build a TokenStream. Using the
StandardAnalyzer, it takes docs that are 1800 tokens just to be
as fast
as re-analyzing. Notice I didn't say fast, but "as fast". Anything
smaller, or if you're using a simpler analyzer, and
TokenSources is
certainly not worth it. It just takes too long to pull
TermVector info.
- Mark
------------------------------------------------------------------
---
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]