Re: Highlighter that works with phrase and span queries

Mike Klaas Wed, 29 Aug 2007 10:29:11 -0700

I just meant whether it would live in a lucene release (somewhereunder contrib/) or just in JIRA. Would including the functionalityin Solr help get it into lucene?


-Mike


On 29-Aug-07, at 4:58 AM, Mark Miller wrote:

It kind of is a contrib -- its really just a new Scorer class (withsome axillary helper classes) for the old contrib Highlighter.Since the contrib Highlighter is pretty hardened at this point, Ifigured that was the best way to go. Or do you mean somethingdifferent?
- Mark

Mike Klaas wrote:
Mark,
I'm still interested in integrating this into Solr--this is afeature that has been requested a few times. It would be easierto do so if it were a contrib/...
thanks for the great work,
-Mike

On 27-Aug-07, at 4:21 AM, Mark Miller wrote:
I am a bit unclear about your question. The patch you mentionextends the original Highlighter to support phrase and spanqueries. It does not include any major performance increases overthe original Highlighter (in fact, it takes a bit longer toHighlight a Span or Phrase query than it does to just highlightTerms).
Will it be released with the next version of Lucene? Doesn't looklike it, but anything is possible. A few people are using it, butthere has not been widespread interest that I have seen. My guessis that there are just not enough people trying to highlight Spanqueries -- which I'd blame on a lack of Span support in thedefault Lucene Query syntax.
Whether it is included soon or not, the code works well and Iwill continue to support it.
- Mark

Michael Stoppelman wrote:
Is this jar going to be in the next release of lucene? Also, arethese the
same as the changes in the following patch:
https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch
-M

On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:
I have not looked at any highlighting code yet. Is therealready an
extension
of PhraseQuery that has getSpans() ?
Currently I am using this code originally by M. Harwood:
Term[] phraseQueryTerms = ((PhraseQuery)query).getTerms();
            int i;
SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length];
            for (i = 0; i < phraseQueryTerms.length; i++) {
clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
            }

            SpanNearQuery sp = new SpanNearQuery(clauses,
                    ((PhraseQuery) query).getSlop(), false);
            sp.setBoost(query.getBoost());
I don't think it is perfect logic for PhraseQuery's editdistance, but
it approximates extremely well in most cases.
I wonder if this approach to Highlighting would be worth it inthe end.Certainly, it would seem to require that you store offsets oryou would
have to re-tokenize anyway.

Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the current
Highlighter if we grab from the source text in bigger chunks.Ronnie'sHighlighter appears to be faster than the original due to twothings: hedoesn't have to re-tokenize text and he rebuilds the originaldocumentin large pieces. Depending on how you want to look at it, heloses mostof the speed gained from just looking at the Query tokensinstead of alltokens to pulling the Term offset information (which appearspretty slow).
If you use a SimpleAnalyzer on docs around 1800 tokens long,you can
actually match the speed of Ronnies highlighter with the current
highlighter if you just rebuild the highlighted documents inbiggerpieces i.e. instead of going through each token and adding thesource
text that it covers, build up the offset information until you get
another hit and then pull from the source text into thehighlighted textin one big piece rather than a tokens worth at a time. Ofcourse this isnot compatible with the way the Fragmenter currently works. Ifyou usethe StandardAnalyzer instead of SimpleAnalyzer, Ronnie'shighlighter
wins because it takes so darn long to re-analyze.
It is also interesting to note that it is very difficult to seein a
gain in using TokenSources to build a TokenStream. Using the
StandardAnalyzer, it takes docs that are 1800 tokens just to beas fast
as re-analyzing. Notice I didn't say fast, but "as fast". Anything
smaller, or if you're using a simpler analyzer, andTokenSources iscertainly not worth it. It just takes too long to pullTermVector info.
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Highlighter that works with phrase and span queries

Reply via email to