Re: Rich positions (was "boosting fields")

Doug Cutting Thu, 27 Apr 2006 12:18:20 -0700

Marvin Humphrey wrote:

Moving away from cached norms was the second of three major changes tothe file format on my agenda, and the one I was all but certain Iwouldn't be able to sell to the Lucene community. The first was usingbytecounts at the head of Strings.
The third was storing start offsets and end offsets in the ProxFile.It rankles that much of the information from tis/frq/prx getsduplicated in the term vector files, but highlighting is most efficientwhen you know the offsets, and the primary index stops short of storingthat information. Currently, we have this:
    ProxFile (.prx) -->  <TermPositions>TermCount

How about this?

    ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

This would at least double the size of the .prx file, the largest filein Lucene's index. Yes it's useful, not not all folks will use it. Sonot all folks should have to pay for it. One way is to try to make itarbitrarily extensible, but to some degree, that's going to end up beinglanguage-specific.

So perhaps instead we should simply allocate more bits in the FieldInfo.We could allocate bits for WEIGHT_PER_POSITION, OFFSETS_IN_PRX,NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc. We can increase the number ofbits there by turning this into a VInt, which would be back-compatible, no?


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Rich positions (was "boosting fields")

Reply via email to