Marvin Humphrey wrote:
Moving away from cached norms was the second of three major changes to
the file format on my agenda, and the one I was all but certain I
wouldn't be able to sell to the Lucene community. The first was using
bytecounts at the head of Strings.
The third was storing start offsets and end offsets in the ProxFile.
It rankles that much of the information from tis/frq/prx gets
duplicated in the term vector files, but highlighting is most efficient
when you know the offsets, and the primary index stops short of storing
that information. Currently, we have this:
ProxFile (.prx) --> <TermPositions>TermCount
How about this?
ProxFile (.prx) --> <TermPositions,TermOffsets>TermCount
This would at least double the size of the .prx file, the largest file
in Lucene's index. Yes it's useful, not not all folks will use it. So
not all folks should have to pay for it. One way is to try to make it
arbitrarily extensible, but to some degree, that's going to end up being
language-specific.
So perhaps instead we should simply allocate more bits in the FieldInfo.
We could allocate bits for WEIGHT_PER_POSITION, OFFSETS_IN_PRX,
NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc. We can increase the number of
bits there by turning this into a VInt, which would be back-compatible, no?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]