score *= normDecoder[norms[doc] & 0xFF]; // normalize for field

If we're talking NORMS_IN_FREQ, then you'd replace that line with one call to getBoost() against the TermDocs. (or maybe getNorm? getMultiplier?)

I'll start there.

Considering I don't have to worry about any index format with the InstanciatedIndex it should be fairly easy to get it working.

Here's the direction I'm headed: One file, the "PostingsFile", which merges the FreqFile, ProxFile, and Boost/Norm for each posting into a single contiguous block, with an eye towards aggressively minimizing disk seeks.

I've worked up a prototype which is a hybrid of the current Lucene design and the version from the Google paper. The advantage of the Google design is that since the postings are fixed width, it is fast and easy to either iterate through them or skip over them. The disadvantage of that design is that the fixed width forces truncation of certain data -- for instance, all positions above 4096 are encoded as 4096, which screws up phrase matching.

For many documents, the fixed width posting format is sufficient, but for a minority of cases, important information can't fit. One answer is to use flag bits to indicate all of the following:

  * Whether the Postings are fixed width or whether they
    had to be encoded using a variable width technique.
  * Whether positions and boosts are stored at all (for
    many queries, all you need to know is that a Term is
    present).
  * Whether the Freq is 1 or encoded seperately.
  * How the DocDelta is encoded.

Fixed with postings are two bytes wide (as with Google). Variable width postings are encoded using VInts. Non-existent postings take up zero bytes. :)

The header consists of 4 flag bits, 4 bits which optionally contribute to the DocDelta, and either an additional byte or an addional VInt to complete the DocDelta. Subsequent positions are delta encoded, like current Lucene and apparently unlike Google 1998. THe variable width posting format is required whenever the TermDoc contains at least one ProxDelta which exceeds the maximum ProxDelta the fixed width format can encode. At 12 bits for position per posting, that's 4095.

The thing I haven't quite figured out yet is how to allocate bits for the Boost. Google uses 4-8 bits per posting for "capitalization", "font size", and a flag indicating a "fancy" hit ie something from a title or anchor. That leaves them 8-12 bits for position information. If we just copy the full 8 bits of Lucene's current byte Norm format, that only leaves 8 bits for the ProxDelta per position. That's not enough -- we'll end up using the variable format way too often, and it's terribly redundant.

It's not obvious to me how to distribute lengthNorm information over several 4-bit posting slots, though. Past hard experience has taught me that a scoring system which isn't normalized for field length suffers from poor precision; we definitely want the a score multiplier in there, and more fine-grained than 16 levels. We do have the position of the posting to work with, though, so we can weight up-front postings more heavily at least.

A test index using the Reuters corpus and WhiteSpaceAnalyzer went from 8 MB to 11 MB under this system. I haven't yet eliminated the norms, but they only take up 40k or so at present.

I found it surprisingly easy to make these changes to KinoSearch; the only two classes that had to be modified were SegTermDocs and PostingsWriter. Maybe experimenting with analogous changes to Lucene will be just as easy.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to