Re: Term highlighting and Term vector patch

Christoph Goller Thu, 16 Sep 2004 02:05:21 -0700

Grant Ingersoll wrote:

Hi,
I was browsing the term highlighting code in the sandbox and I noticed
the following comment for the getBestFragment method in the
Highlighter.java code:
/** ... * @param tokenStream a stream of tokens identified in the text parameter, including offset information. * This is typically produced by an analyzer re-parsing a document's * text. Some work may be done on retrieving TokenStreams more efficently * by adding support for storing original text position data in the Lucene * index but this support is not currently available (as of Lucene 1.4 rc2). ... */
which struck me that I might be able to contribute some more time to
make this so, since I recently submitted a patch to offer just such an
enhancement to the term vector.
I would like to implement this, but I don't really want to submit a
patch against another patch (It's hard enough managing all the changes
that come down).  So, I was wondering if anyone (i.e. a committer) has
had a chance to look at the Term Vector offset patch and what their
thoughts are on it?  I can see the performance improvements in the
highlighter that would come about by avoiding having to re-analyze the
text, plus you could highlight the whole field if you wanted to.


Hi Grant,

I try to look into your latest code by the end of September but I probably
won't find time earlier. I am using the current TermVectors very successfully.
Thanks for the excellent code.

Your new patch provides the ability to store positions and token offset,
doesn't it?

As far as I remember, there is also Bernhard's patch for making TermVectors
more efficient in case of multiple threads using one IndexReader, and there
are the API changes from Daniel that might influence your patch too. Is all
this in sync?

Also, if I make this change, do the committers suggest I keep the
current ability to analyze and have this as an alternative, or would it
be safe to assume this is only used when offset info is stored?


Storing the offsets will increase index size considerably. So one will not
always want to do that. I guess highlighting should continue to work with
reanalyzing. However, I know that this makes coding much more complex. You
always have to maintain two versions of the highlighter ....
What do others think?

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term highlighting and Term vector patch

Reply via email to