On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote:

If I have a pluggable indexer,
then on the querying side I need something (I'm not sure what/how)
that knows how to create the right demuxer (container) and codec
(decoder) to interact with whatever my indexing plugins wrote.

So I don't think it's "out of bounds" to have *Query classes that know
to use the non-prx variant of a field when positions aren't needed.
Though I agree it makes things more complex.

Perhaps switching up files based on Query type isn't "out of bounds", but you get hit with a double whammy. First, there's the added complexity -- which gets extra nasty when you consider that we wouldn't want this optimization to be on by default. Second, you wind up with duplicated information in the index -- resulting in greater space requirements and increased indexing time.

That's an awful lot of inconvenience for the sake of an optimization. I wouldn't bother.

My initial thought was that the unified postings file format that KS is using now offers so many benefits that KS should go with it anyway, despite the increased costs for simple TermQueries. But maybe we can nail the design for the divide-by-data-type format right now.

This is like "column stride" vs "row stride" serialization
of a matrix.

Relatively soon, though, we will all be on SSDs, so maybe this
locality argument becomes far less important ;)


Yes, I've thought about that.  It defeats the phrase-query locality
argument for unified postings files and recommends breaking things up
logically by type of data into frq/prx/payload/whatever.

Would it be possible to design a Posting plugin class that reads from
multiple files?  I'm pretty sure the answer is yes.  It messes up the
single-stream readRecord() method I've been detailing and might force
Posting to maintain state. But if Postings are scarce TermBuffer- style objects where values mutate, rather than populous Term-style objects where you need a new instance for each set of values, then it doesn't matter if
they're large.

If that could be done, I think it would be possible to retrofit the
Posting/PostingList concept into Lucene without a file format change. FWIW.

I think this is possible.

We should aggressively explore this concept. By my tally, breaking up posting content into multiple files by data type has the most benefits and the fewest drawbacks.

Brainstorming...

In devel branch KS, ScorePost_Read_Record reads three blocks of data from an InStream:

  doc_num/freq
  boost/norm byte
  positions

Let's say that instead of ScorePost_Read_Record operating on a passed- in InStream, we give ScorePosting a ScorePost_Next() method, and have it operate on three internally maintained InStreams:

    void
    ScorePost_next(ScorePosting *self)
    {
        u32_t  num_prox;
        u32_t  position = 0;
        u32_t *positions;
        const u32_t doc_code = InStream_Read_C32(self->frq_in);
        const u32_t doc_delta = doc_code >> 1;

        /* Apply delta doc and retrieve freq. */
        self->doc_num  += doc_delta;
        if (doc_code & 1)
            self->freq = 1;
        else
            self->freq = InStream_Read_C32(self->frq_in);

        /* Decode boost/norm byte. */
        self->impact = self->sim->norm_decoder[
                InStream_Read_U8(self->norms_in)
            ];

        /* Read positions. */
        num_prox = self->freq;
        if (num_prox > self->prox_cap) {
            self->prox = REALLOCATE(self->prox, num_prox, u32_t);
        }
        positions = self->prox;
        while (num_prox--) {
            position += InStream_Read_C32(self->prx_in);
            *positions++ = position;
        }
    }

So, there we have a single operation, in a single class, defining a codec -- achieving my main goal, even while reading from multiple files.

But check this out: we could feed the ScorePosting constructor the same InStream object three times, and it wouldn't know the difference. So the same read operation can be used against both a multi-file format and a single file format.

Seeking might get a little weird, I suppose.

I think a variant of that read op could be created against various versions of the Lucene file format going back, making it possible to isolate and archive obsolete codecs and clean up the container classes.

When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.

Yes.

Now's a good time to remark on a difference between KS and Lucene when reading the equivalent of TermPositions: In KS, all positions are read in one grab during ScorePost_Read_Record -- there's no nextPosition() method. That was a tradeoff I made primarily for simplicity's sake, since it meant that PhraseScorer could be implemented with integer arrays and pointer math. Reduced CPU overhead was another theoretical benefit, but I've never benchmarked it.

If you didn't want to do that but you still wanted to implement a PhraseScorer based around PostingList objects rather than TermPositions objects, you have a bit of a quandary: PostingList only advances in doc-sized increments, because it doesn't have a nextPosition() method. So, nextPosition() would have to be implemented by ScorePosting:

  // Iterate over docs and positions.
  while (postingList.next()) {
    ScorePosting posting = postingList.getPosting();
    while (posting.nextPostition()) {
      int position = posting.GetPosition();
      ...
    }
  }

If we did that, it would require a change in the behavior for ScorePost_Next(), above.

Thinking about a codec reading multiple files... I would also love
this: a codec that can write & read layered updates to the inverted
files.

IMO, it should not be the codec object that performs actions like this -- it should be a container class that the user knows nothing about.

I agree, Lucene needs a stronger separation of "container" from
"codec".  If someone just wants to plugin a new codec they should be
able to cleanly plug into an existing container and "only" provide the
codec.

EG, say I want to store say an arbitrary array of ints per document.
Not all documents have an array, and when they do, the array length
varies.

To do this, I'd like to re-use a container that knows how to store a
byte blob per document, sparsely, and say indexed with a multi-level
skip list.  That container is exactly the container we now use for
storing frq/prx data, under each term.  So, ideally, I can easily
re-use that container and just provide a codec (encoder & decoder)
that maps from an int[] to a byte blob, and back.

We need to factor things so that this container can be easily shared
and is entirely decoupled from the codecs it's using.

Perfect.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to