score *= normDecoder[norms[doc] & 0xFF]; // normalize for
field
If we're talking NORMS_IN_FREQ, then you'd replace that line with
one call to getBoost() against the TermDocs. (or maybe getNorm?
getMultiplier?)
I'll start there.
Considering I don't have to worry about any index format with the
InstanciatedIndex it should be fairly easy to get it working.
Here's the direction I'm headed: One file, the "PostingsFile",
which merges the FreqFile, ProxFile, and Boost/Norm for each posting
into a single contiguous block, with an eye towards aggressively
minimizing disk seeks.
I've worked up a prototype which is a hybrid of the current Lucene
design and the version from the Google paper. The advantage of the
Google design is that since the postings are fixed width, it is fast
and easy to either iterate through them or skip over them. The
disadvantage of that design is that the fixed width forces truncation
of certain data -- for instance, all positions above 4096 are encoded
as 4096, which screws up phrase matching.
For many documents, the fixed width posting format is sufficient, but
for a minority of cases, important information can't fit. One answer
is to use flag bits to indicate all of the following:
* Whether the Postings are fixed width or whether they
had to be encoded using a variable width technique.
* Whether positions and boosts are stored at all (for
many queries, all you need to know is that a Term is
present).
* Whether the Freq is 1 or encoded seperately.
* How the DocDelta is encoded.
Fixed with postings are two bytes wide (as with Google). Variable
width postings are encoded using VInts. Non-existent postings take
up zero bytes. :)
The header consists of 4 flag bits, 4 bits which optionally
contribute to the DocDelta, and either an additional byte or an
addional VInt to complete the DocDelta. Subsequent positions are
delta encoded, like current Lucene and apparently unlike Google
1998. THe variable width posting format is required whenever the
TermDoc contains at least one ProxDelta which exceeds the maximum
ProxDelta the fixed width format can encode. At 12 bits for position
per posting, that's 4095.
The thing I haven't quite figured out yet is how to allocate bits for
the Boost. Google uses 4-8 bits per posting for "capitalization",
"font size", and a flag indicating a "fancy" hit ie something from a
title or anchor. That leaves them 8-12 bits for position
information. If we just copy the full 8 bits of Lucene's current
byte Norm format, that only leaves 8 bits for the ProxDelta per
position. That's not enough -- we'll end up using the variable
format way too often, and it's terribly redundant.
It's not obvious to me how to distribute lengthNorm information over
several 4-bit posting slots, though. Past hard experience has taught
me that a scoring system which isn't normalized for field length
suffers from poor precision; we definitely want the a score
multiplier in there, and more fine-grained than 16 levels. We do
have the position of the posting to work with, though, so we can
weight up-front postings more heavily at least.
A test index using the Reuters corpus and WhiteSpaceAnalyzer went
from 8 MB to 11 MB under this system. I haven't yet eliminated the
norms, but they only take up 40k or so at present.
I found it surprisingly easy to make these changes to KinoSearch; the
only two classes that had to be modified were SegTermDocs and
PostingsWriter. Maybe experimenting with analogous changes to Lucene
will be just as easy.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]