rolaren...@earthlink.net wrote:
Now I notice (from googling) that I can also downcast TermFreqVector
to TermPositionVector, which contains the offsets (which I will
need).
So -- under what conditions would that cast fail?
The cast fails if you had indexed the field with
Field.TermVector.YES,
which does not store positions nor offsets information. If you
always
index the field with TermVector.WITH_OFFSET, WITH_POSITIONS or
WITH_POSITIONS_OFFSETS, the cast will always succeed.
OK, cool.
I see in the javadocs for TermPositionVector that it "not
necessarily contains both positions and offsets, but at least one of
these arrays exists"; does it work like this, I think:
TermVector.WITH_OFFSETS => TermVectorOffsetInfo[] always exists (so
far, works for me)
TermVector.WITH_POSITIONS => positions int[] always exists
TermVector.WITH_POSITIONS_OFFSETS => both arrays always exist
Right.
Right? And I guess the reason for using TermVector.WITH_POSITIONS =>
positions int[] is that it has a smaller memory footprint?
Well, first: it's storing something different. Position is (by
default) the term count, ie first term is position 0, next is position
1, etc. Whereas start/end offset are normally the character locations
where each term started and ended. These are computed during analysis
and stored into the index.
Storing only positions gives a smaller index size than only offsets or
positions plus offsets.
The memory difference is typically a non-issue since an app normally
doesn't store these instances around for a long time. Ie normally you
pull them from the index, do something interesting, and let them go,
during a search request.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org