Re: highlighting phrase query

Mark Miller Mon, 02 Jul 2007 14:12:40 -0700

There has been a lot of Highlighter discussion lately, but just to tryand sum up the state of Highlighting in the Lucene world:

There are four Highlighter implementations that I know of. From what Ican tell, only the original Contrib Highlighter has received sustainedactive development by more than one individual.


Contrib Highlighter:

The Contrib Highlighter supports the widest array of analyzers andcorner cases and has had the widest exposure. It is generally slower onlarger documents due to the requirement that you re-analyze text and tosupport a wider variety of use cases -- the TokenGroup for tokenoverlaps and inspecting every term for Fragmentation contribute to ahuge performance drain on large documents. This highlighter does notsupport highlighting based on position and all terms from the query willbe highlighted in the text. You can avoid some of the cost ofre-analyzing by using the TokenSources class to rebuild a TokenStreamusing stored offsets and/or positions, but this is unlikely to be fasterunless you are using very large documents with a complex analyzer.Getting and sorting offsets/positions is relatively slow and for smallerdocs it is faster to just re-analyze.


LUCENE-403:

I have not spent a lot of time with this approach, but it is similar tothe Contrib Highlighter approach. It almost certainly does not cover asmany odd corner cases as Contrib Highlighter and the framework islacking, but it does add some support for proper PhraseQueryhighlighting by implementing some custom PhraseQuery search logic.Because LUCENE-403 is not as rigorous as the Contrib Highlighter, it maywell be a bit faster. The author claims that HTML tags will not bebroken when fragmenting.


LUCENE-644:

This Highlighter approach requires that you have stored term offsets inthe index. This Highlighter can be very fast if you are using acomplicated analyzer since there is no need for re-analyzing the text(due to the stored offsets). Also, rather then scoring every term likethe Contrib Highlighter, only terms from the query are effectively"handled". For smaller documents and simpler analyzers there is not muchspeed improvement over the Contrib Highlighter (due to the time it takesto retrieve and sort offsets), but for larger documents , especiallywith more complex analyzers, this Highlighter can be extremely fast.Again, positional highlighting for Phrase and Span queries is notsupported.The biggest reason this implementation performs so well is that it dealswith the text in much bigger chunks. Contrib Highlighter can also avoidre-analyzing by storing offsets and positions, but then it scores thedocument and rebuilds the text one token at a time using the performancedraining TokenGroup (which helps cover some of those corner cases). Thisis very slow on very large documents.


LUCENE-794:

This approach extends the Contrib Highlighter to support HighlightingSpan and Phrase queries. The approach used for non position sensitiveQuery clauses is the same as the Contrib Highlighter, and if you use thelatest CachingTokenFilter the speed is roughly about the same. Positionsensitive Query clauses are a bit slower as a MemoryIndex is used toretrieve the correct positions to Highlight. This gives exacthighlighting without reimplementing search logic. Also, all of the usecases and corner cases that have been solved for the Contrib Highlighterare retained. All of the deficiencies of the Contrib Highlighter (sloweron large docs) are also retained. The majority of the code for thiscomes from the Contrib Highlighter -- it uses the Contrib Highlighterframework. Which points out a plus for the Contrib Highlighter setup --it allows for an extension like this, while LUCENE-644 could not easilybe expanded to handle position sensitive queries.

There has been some discussion of getting Lucene to identify correcthighlights as the search is processed. I am not very optimistic thatthis will be fruitful, but those that are discussing it know more moreabout this than I do.


- Mark

sandeep chawla wrote:

Hi All,

I am developing a search tool using lucene. I am using lucene 2.1.

i have a requirement to highlight query words in the results.
.Lucene-highlighter 2.1 doesn't work well in highlighting phase query.

For example - if i have a query string "lucene Java" .It highlights
not only occurrences of "lucene java" but occurrences of lucene and
java too in the text.

I think, this is a known problem..is this issue solved in lucene 2.2.
well my application is almost complete and i really don't wanna switch
to lucene 2.2.

I was going through previous posts but i couldn't find a solution of
this problem. There r some alternate highlighter s but it seems, they
r not stable and still in evolution phase.

I am looking for a standard n stable API for this purpose..

I'd appreciate any thoughts/guidance in this issue.

Thanks
Sandeep


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: highlighting phrase query

Reply via email to