Mike Klaas wrote:
On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote:
: What I would LOVE is if I could do it in a standard Lucene search
like I
: mentioned earlier.
: Hit.doc[0].getHitTokenList() :confused:
: Something like this...
The Query/Scorer APIs don't provide any mechanism for information
like
that to be conveyed back up the call chain -- mainly because it's
more
heavy weight then most people need.
If you have custom Query/Scorer implementations, you can keep track
of
whatever state you want when executing a QUery -- in fact the
SpanQuery
family of queries do keep track of exactly the type of info you
seem to
want, and after executing a query, you can ask it for the "Spans"
of any
matching document -- the down side is the a loss in performance of
query
execution (because it takes time/memory to keep track of all the
matches)
Even then, if I'm not mistaken, spans track token _positions_, not
_offsets_ in the original string.
That's correct.
A reverse text index like lucene is fast precisely because it
doesn't have to keep track of this information.
One option is to stuff the offsets into payloads, and then make a
custom Query that decodes the offsets from the payload, and store it
away when collecting hits.
I think the best alternative might be to use termvectors, which are
essentially a cache of the analyzed tokens for a document.
Another way to think of term vectors is a single-document inverted
index that you can retrieve in entirety. Ie, it maps terms to their
occurrences (count, positions, offsets) within the document.
I agree, term vectors should work for this.
I don't really understand, though, why the highlighter package doesn't
work here -- it also just re-analyzes the text, when it can't find
term vectors.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org