On Oct 19, 2007, at 3:53 PM, Michael Busch wrote:

The next question would be how to store the per-doc payloads (PDP). If
all values have the same length (as the unique docIds), then we should
store them as efficiently as possible, like the norms. However, we still want to offer the flexibility of having variable-length values. For this
case we could use a new data structure similar to our posting list.

PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
SkipList>
FixedLengthPDPList    --> <Payload>^SegSize
VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
Payload               --> Byte^PayloadLength
PayloadLength         --> VInt
SkipList              --> see frq.file

There's another approach, which has the following advantages:

  * Simpler.
  * Pluggable.
  * More future proof.
  * More closely models IR Theory.
  * Easier for other implementations to deal with.
  * Breaks the tight binding between Lucene and its file format.

Start with a Posting base class.

  public class Posting {
    private int docNum;
    private int lastDocNum = 0;

    public int getDocNum { return docNum; }

    public void read(IndexInput inStream) {
      docNum += inStream.readVInt();
    }

    public void write(IndexOutput outStream) {
      outStream.writeVInt(docNum - lastDocNum);
    }
  }

Then, PostingList (subclassed by SegPostingList and MultiPostingList, naturally).

  public abstract class PostingList {
     public abstract Posting getPosting();
     public abstract boolean next() throws IOException;
     public boolean skipTo(int target) throws IOException;
  }

Each field gets its own "postings" file within the segment, named _SEGNUM_FIELDNUM.p, where SEGNUM and FIELDNUM are encoded using base 36. Each of these files is a solid stack of serialized Postings.

Posting subclasses like ScorePosting, PayloadPosting, etc, implement their own read() and write() methods. Thus, Posting subclasses wholly define their own file format -- instead of the current, brittle design, where read/write code is dispersed over multiple classes. If some Posting types become obsolete, they can be deprecated, but PostingList and its subclasses won't require the addition of crufty special case code to stay back-compatible.

There's more (I've written a working implementation), but that's the gist.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to