Grant Ingersoll wrote:
Hi,
I was browsing the term highlighting code in the sandbox and I noticed
the following comment for the getBestFragment method in the
Highlighter.java code:
/**
...
* @param tokenStream a stream of tokens identified in the
text parameter, including offset information.
* This is typically produced by an analyzer re-parsing a
document's
* text. Some work may be done on retrieving TokenStreams more
efficently
* by adding support for storing original text position data in
the Lucene
* index but this support is not currently available (as of
Lucene 1.4 rc2).
...
*/
which struck me that I might be able to contribute some more time to
make this so, since I recently submitted a patch to offer just such an
enhancement to the term vector.
I would like to implement this, but I don't really want to submit a
patch against another patch (It's hard enough managing all the changes
that come down). So, I was wondering if anyone (i.e. a committer) has
had a chance to look at the Term Vector offset patch and what their
thoughts are on it? I can see the performance improvements in the
highlighter that would come about by avoiding having to re-analyze the
text, plus you could highlight the whole field if you wanted to.
Hi Grant,
I try to look into your latest code by the end of September but I probably
won't find time earlier. I am using the current TermVectors very successfully.
Thanks for the excellent code.
Your new patch provides the ability to store positions and token offset,
doesn't it?
As far as I remember, there is also Bernhard's patch for making TermVectors
more efficient in case of multiple threads using one IndexReader, and there
are the API changes from Daniel that might influence your patch too. Is all
this in sync?
Also, if I make this change, do the committers suggest I keep the
current ability to analyze and have this as an alternative, or would it
be safe to assume this is only used when offset info is stored?
Storing the offsets will increase index size considerably. So one will not
always want to do that. I guess highlighting should continue to work with
reanalyzing. However, I know that this makes coding much more complex. You
always have to maintain two versions of the highlighter ....
What do others think?
Christoph
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]