Re: Flexible indexing design

Marvin Humphrey Fri, 18 Apr 2008 14:48:00 -0700


On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote:

If I have a pluggable indexer,
then on the querying side I need something (I'm not sure what/how)
that knows how to create the right demuxer (container) and codec
(decoder) to interact with whatever my indexing plugins wrote.

So I don't think it's "out of bounds" to have *Query classes that know
to use the non-prx variant of a field when positions aren't needed.
Though I agree it makes things more complex.

Perhaps switching up files based on Query type isn't "out of bounds",but you get hit with a double whammy. First, there's the addedcomplexity -- which gets extra nasty when you consider that wewouldn't want this optimization to be on by default. Second, you windup with duplicated information in the index -- resulting in greaterspace requirements and increased indexing time.

That's an awful lot of inconvenience for the sake of an optimization.I wouldn't bother.

My initial thought was that the unified postings file format that KSis using now offers so many benefits that KS should go with it anyway,despite the increased costs for simple TermQueries. But maybe we cannail the design for the divide-by-data-type format right now.

This is like "column stride" vs "row stride" serialization
of a matrix.

Relatively soon, though, we will all be on SSDs, so maybe this
locality argument becomes far less important ;)


Yes, I've thought about that.  It defeats the phrase-query locality
argument for unified postings files and recommends breaking things up
logically by type of data into frq/prx/payload/whatever.

Would it be possible to design a Posting plugin class that reads from
multiple files?  I'm pretty sure the answer is yes.  It messes up the
single-stream readRecord() method I've been detailing and might force

Posting to maintain state. But if Postings are scarce TermBuffer-styleobjects where values mutate, rather than populous Term-styleobjects whereyou need a new instance for each set of values, then it doesn'tmatter if

they're large.

If that could be done, I think it would be possible to retrofit the

Posting/PostingList concept into Lucene without a file formatchange. FWIW.


I think this is possible.

We should aggressively explore this concept. By my tally, breaking upposting content into multiple files by data type has the most benefitsand the fewest drawbacks.


Brainstorming...

In devel branch KS, ScorePost_Read_Record reads three blocks of datafrom an InStream:


  doc_num/freq
  boost/norm byte
  positions

Let's say that instead of ScorePost_Read_Record operating on a passed-in InStream, we give ScorePosting a ScorePost_Next() method, and haveit operate on three internally maintained InStreams:


    void
    ScorePost_next(ScorePosting *self)
    {
        u32_t  num_prox;
        u32_t  position = 0;
        u32_t *positions;
        const u32_t doc_code = InStream_Read_C32(self->frq_in);
        const u32_t doc_delta = doc_code >> 1;

        /* Apply delta doc and retrieve freq. */
        self->doc_num  += doc_delta;
        if (doc_code & 1)
            self->freq = 1;
        else
            self->freq = InStream_Read_C32(self->frq_in);

        /* Decode boost/norm byte. */
        self->impact = self->sim->norm_decoder[
                InStream_Read_U8(self->norms_in)
            ];

        /* Read positions. */
        num_prox = self->freq;
        if (num_prox > self->prox_cap) {
            self->prox = REALLOCATE(self->prox, num_prox, u32_t);
        }
        positions = self->prox;
        while (num_prox--) {
            position += InStream_Read_C32(self->prx_in);
            *positions++ = position;
        }
    }

So, there we have a single operation, in a single class, defining acodec -- achieving my main goal, even while reading from multiple files.

But check this out: we could feed the ScorePosting constructor thesame InStream object three times, and it wouldn't know thedifference. So the same read operation can be used against both amulti-file format and a single file format.


Seeking might get a little weird, I suppose.

I think a variant of that read op could be created against variousversions of the Lucene file format going back, making it possible toisolate and archive obsolete codecs and clean up the container classes.

When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.


Yes.

Now's a good time to remark on a difference between KS and Lucene whenreading the equivalent of TermPositions: In KS, all positions are readin one grab during ScorePost_Read_Record -- there's no nextPosition()method. That was a tradeoff I made primarily for simplicity's sake,since it meant that PhraseScorer could be implemented with integerarrays and pointer math. Reduced CPU overhead was another theoreticalbenefit, but I've never benchmarked it.

If you didn't want to do that but you still wanted to implement aPhraseScorer based around PostingList objects rather thanTermPositions objects, you have a bit of a quandary: PostingList onlyadvances in doc-sized increments, because it doesn't have anextPosition() method. So, nextPosition() would have to beimplemented by ScorePosting:


  // Iterate over docs and positions.
  while (postingList.next()) {
    ScorePosting posting = postingList.getPosting();
    while (posting.nextPostition()) {
      int position = posting.GetPosition();
      ...
    }
  }

If we did that, it would require a change in the behavior forScorePost_Next(), above.

Thinking about a codec reading multiple files... I would also love
this: a codec that can write & read layered updates to the inverted
files.

IMO, it should not be the codec object that performs actions like this-- it should be a container class that the user knows nothing about.

I agree, Lucene needs a stronger separation of "container" from
"codec".  If someone just wants to plugin a new codec they should be
able to cleanly plug into an existing container and "only" provide the
codec.

EG, say I want to store say an arbitrary array of ints per document.
Not all documents have an array, and when they do, the array length
varies.

To do this, I'd like to re-use a container that knows how to store a
byte blob per document, sparsely, and say indexed with a multi-level
skip list.  That container is exactly the container we now use for
storing frq/prx data, under each term.  So, ideally, I can easily
re-use that container and just provide a codec (encoder & decoder)
that maps from an int[] to a byte blob, and back.

We need to factor things so that this container can be easily shared
and is entirely decoupled from the codecs it's using.


Perfect.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to