I intend to release a new version of the highlighter soon that should (hopefully)
address some of the issues under discussion.
The re-design will be based on the following principles:
* A TokenStream will be passed to the highlighter to provide the source of tokens. The
token stream could be provided by an analyzer re-tokenizing the document text or by
some future extension to Lucene that is capable of storing token offsets in the index.
This change effectively abstracts the highlighter from the index-time or query-time
tokenization choices people should be free to make.
* The highlighter already has an abstraction from the list of terms that are needed to
be highlighted - see TextHighlighter. The only change I plan here is to introduce the
notion of a WeightedTerm that associates a weight with each term to be highlighted in
order to influence selection of the best fragments. The QueryHighlightExtractor class
will be deprecated and will simply become a tool for extracting a list of terms from a
query so that they can be passed to the TextHighlighter class.
* I will make the fragmentation logic a pluggable class that can change the way the
highlighter decides to split documents into fragments. The current implementation
simply splits after "n" tokens. I will introduce a new DocSplitter interface to allow
alternative implementations to split documents up, eg based on recognizing end of
sentences by the ?!. characters. I dont plan to provide a sentence splitter at this
stage - too much work!
Hopefully this provides an open framework for folks to do what they want with the
highlighter. Please let me have any comments if you have any suggestions.
As for ownership/support, we already had the vote on whether highlighter is accepted
as part of Lucene core or not and it wasn't. I dont mind either way but I would like
to make the above changes before anyone considers moving it to the sandbox or wherever.
I'm going to be away for the next 5 days so there may be a delay in any work on this.
Cheers
Mark
---------------------------------
WIN FREE WORLDWIDE FLIGHTS - nominate a cafe in the Yahoo! Mail Internet Cafe Awards