Re: Rich positions (was "boosting fields")

Marvin Humphrey Sat, 29 Apr 2006 00:40:37 -0700

score *= normDecoder[norms[doc] & 0xFF]; // normalize forfield
If we're talking NORMS_IN_FREQ, then you'd replace that line withone call to getBoost() against the TermDocs. (or maybe getNorm?getMultiplier?)
I'll start there.
Considering I don't have to worry about any index format with theInstanciatedIndex it should be fairly easy to get it working.

Here's the direction I'm headed: One file, the "PostingsFile",which merges the FreqFile, ProxFile, and Boost/Norm for each postinginto a single contiguous block, with an eye towards aggressivelyminimizing disk seeks.

I've worked up a prototype which is a hybrid of the current Lucenedesign and the version from the Google paper. The advantage of theGoogle design is that since the postings are fixed width, it is fastand easy to either iterate through them or skip over them. Thedisadvantage of that design is that the fixed width forces truncationof certain data -- for instance, all positions above 4096 are encodedas 4096, which screws up phrase matching.

For many documents, the fixed width posting format is sufficient, butfor a minority of cases, important information can't fit. One answeris to use flag bits to indicate all of the following:


  * Whether the Postings are fixed width or whether they
    had to be encoded using a variable width technique.
  * Whether positions and boosts are stored at all (for
    many queries, all you need to know is that a Term is
    present).
  * Whether the Freq is 1 or encoded seperately.
  * How the DocDelta is encoded.

Fixed with postings are two bytes wide (as with Google). Variablewidth postings are encoded using VInts. Non-existent postings takeup zero bytes. :)

The header consists of 4 flag bits, 4 bits which optionallycontribute to the DocDelta, and either an additional byte or anaddional VInt to complete the DocDelta. Subsequent positions aredelta encoded, like current Lucene and apparently unlike Google1998. THe variable width posting format is required whenever theTermDoc contains at least one ProxDelta which exceeds the maximumProxDelta the fixed width format can encode. At 12 bits for positionper posting, that's 4095.

The thing I haven't quite figured out yet is how to allocate bits forthe Boost. Google uses 4-8 bits per posting for "capitalization","font size", and a flag indicating a "fancy" hit ie something from atitle or anchor. That leaves them 8-12 bits for positioninformation. If we just copy the full 8 bits of Lucene's currentbyte Norm format, that only leaves 8 bits for the ProxDelta perposition. That's not enough -- we'll end up using the variableformat way too often, and it's terribly redundant.

It's not obvious to me how to distribute lengthNorm information overseveral 4-bit posting slots, though. Past hard experience has taughtme that a scoring system which isn't normalized for field lengthsuffers from poor precision; we definitely want the a scoremultiplier in there, and more fine-grained than 16 levels. We dohave the position of the posting to work with, though, so we canweight up-front postings more heavily at least.

A test index using the Reuters corpus and WhiteSpaceAnalyzer wentfrom 8 MB to 11 MB under this system. I haven't yet eliminated thenorms, but they only take up 40k or so at present.

I found it surprisingly easy to make these changes to KinoSearch; theonly two classes that had to be modified were SegTermDocs andPostingsWriter. Maybe experimenting with analogous changes to Lucenewill be just as easy.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Rich positions (was "boosting fields")

Reply via email to