Re: Getting tokens from search results. Simple concept

Michael McCandless Sat, 07 Mar 2009 05:47:46 -0800


Mike Klaas wrote:

On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote:
: What I would LOVE is if I could do it in a standard Lucene searchlike I
: mentioned earlier.
: Hit.doc[0].getHitTokenList() :confused:
: Something like this...
The Query/Scorer APIs don't provide any mechanism for informationlikethat to be conveyed back up the call chain -- mainly because it'smore
heavy weight then most people need.
If you have custom Query/Scorer implementations, you can keep trackofwhatever state you want when executing a QUery -- in fact theSpanQueryfamily of queries do keep track of exactly the type of info youseem towant, and after executing a query, you can ask it for the "Spans"of anymatching document -- the down side is the a loss in performance ofqueryexecution (because it takes time/memory to keep track of all thematches)
Even then, if I'm not mistaken, spans track token _positions_, not_offsets_ in the original string.


That's correct.

A reverse text index like lucene is fast precisely because itdoesn't have to keep track of this information.

One option is to stuff the offsets into payloads, and then make acustom Query that decodes the offsets from the payload, and store itaway when collecting hits.

I think the best alternative might be to use termvectors, which areessentially a cache of the analyzed tokens for a document.

Another way to think of term vectors is a single-document invertedindex that you can retrieve in entirety. Ie, it maps terms to theiroccurrences (count, positions, offsets) within the document.


I agree, term vectors should work for this.

I don't really understand, though, why the highlighter package doesn'twork here -- it also just re-analyzes the text, when it can't findterm vectors.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Getting tokens from search results. Simple concept

Reply via email to