I intend to release a new version of the highlighter soon that should (hopefully) 
address some of the issues under discussion.
The re-design will be based on the following principles:
 
* A TokenStream will be passed to the highlighter to provide the source of tokens. The 
token stream could be provided by an analyzer re-tokenizing the document text or by 
some future extension to Lucene that is capable of storing token offsets in the index. 
This change effectively abstracts the highlighter from the index-time or query-time 
tokenization choices people should be free to make.
 
* The highlighter already has an abstraction from the list of terms that are needed to 
be highlighted - see TextHighlighter. The only change I plan here is to introduce the 
notion of a WeightedTerm that associates a weight with each term to be highlighted in 
order to influence selection of the best fragments. The QueryHighlightExtractor class 
will be deprecated and will simply become a tool for extracting a list of terms from a 
query so that they can be passed to the TextHighlighter class.
 
* I will make the fragmentation logic a pluggable class that can change the way the 
highlighter decides to split documents into fragments. The current implementation 
simply splits after "n" tokens. I will introduce a new DocSplitter interface to allow 
alternative implementations to split documents up, eg based on recognizing end of 
sentences by the ?!. characters. I dont plan to provide a sentence splitter at this 
stage - too much work!
 
Hopefully this provides an open framework for folks to do what they want with the 
highlighter. Please let me have any comments if you have any suggestions.
 
As for ownership/support, we already had the vote on whether highlighter is accepted 
as part of Lucene core or not and it wasn't. I dont mind either way but I would like 
to make the above changes before anyone considers moving it to the sandbox or wherever.
 
I'm going to be away for the next 5 days so there may be a delay in any work on this.
 
Cheers
Mark

                
---------------------------------
 WIN FREE WORLDWIDE FLIGHTS - nominate a cafe in the Yahoo! Mail Internet Cafe Awards

Reply via email to