rolaren...@earthlink.net wrote:


Now I notice (from googling) that I can also downcast TermFreqVector
to TermPositionVector, which contains the offsets (which I will need).

So -- under what conditions would that cast fail?

The cast fails if you had indexed the field with Field.TermVector.YES, which does not store positions nor offsets information. If you always
index the field with TermVector.WITH_OFFSET, WITH_POSITIONS or
WITH_POSITIONS_OFFSETS, the cast will always succeed.

OK, cool.

I see in the javadocs for TermPositionVector that it "not necessarily contains both positions and offsets, but at least one of these arrays exists"; does it work like this, I think:

TermVector.WITH_OFFSETS => TermVectorOffsetInfo[] always exists (so far, works for me)
TermVector.WITH_POSITIONS => positions int[] always exists
TermVector.WITH_POSITIONS_OFFSETS => both arrays always exist

Right.

Right? And I guess the reason for using TermVector.WITH_POSITIONS => positions int[] is that it has a smaller memory footprint?


Well, first: it's storing something different. Position is (by default) the term count, ie first term is position 0, next is position 1, etc. Whereas start/end offset are normally the character locations where each term started and ended. These are computed during analysis and stored into the index.

Storing only positions gives a smaller index size than only offsets or positions plus offsets.

The memory difference is typically a non-issue since an app normally doesn't store these instances around for a long time. Ie normally you pull them from the index, do something interesting, and let them go, during a search request.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to