Marvin Humphrey wrote:
Moving away from cached norms was the second of three major changes to the file format on my agenda, and the one I was all but certain I wouldn't be able to sell to the Lucene community. The first was using bytecounts at the head of Strings.

The third was storing start offsets and end offsets in the ProxFile. It rankles that much of the information from tis/frq/prx gets duplicated in the term vector files, but highlighting is most efficient when you know the offsets, and the primary index stops short of storing that information. Currently, we have this:

    ProxFile (.prx) -->  <TermPositions>TermCount

How about this?

    ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

This would at least double the size of the .prx file, the largest file in Lucene's index. Yes it's useful, not not all folks will use it. So not all folks should have to pay for it. One way is to try to make it arbitrarily extensible, but to some degree, that's going to end up being language-specific.

So perhaps instead we should simply allocate more bits in the FieldInfo. We could allocate bits for WEIGHT_PER_POSITION, OFFSETS_IN_PRX, NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc. We can increase the number of bits there by turning this into a VInt, which would be back-compatible, no?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to