Mike Klaas wrote:

On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote:


: What I would LOVE is if I could do it in a standard Lucene search like I
: mentioned earlier.
: Hit.doc[0].getHitTokenList() :confused:
: Something like this...

The Query/Scorer APIs don't provide any mechanism for information like that to be conveyed back up the call chain -- mainly because it's more
heavy weight then most people need.

If you have custom Query/Scorer implementations, you can keep track of whatever state you want when executing a QUery -- in fact the SpanQuery family of queries do keep track of exactly the type of info you seem to want, and after executing a query, you can ask it for the "Spans" of any matching document -- the down side is the a loss in performance of query execution (because it takes time/memory to keep track of all the matches)

Even then, if I'm not mistaken, spans track token _positions_, not _offsets_ in the original string.

That's correct.

A reverse text index like lucene is fast precisely because it doesn't have to keep track of this information.

One option is to stuff the offsets into payloads, and then make a custom Query that decodes the offsets from the payload, and store it away when collecting hits.

I think the best alternative might be to use termvectors, which are essentially a cache of the analyzed tokens for a document.

Another way to think of term vectors is a single-document inverted index that you can retrieve in entirety. Ie, it maps terms to their occurrences (count, positions, offsets) within the document.

I agree, term vectors should work for this.

I don't really understand, though, why the highlighter package doesn't work here -- it also just re-analyzes the text, when it can't find term vectors.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to