Re: Flexible indexing design

Michael McCandless Thu, 24 Apr 2008 04:48:25 -0700

Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
>  On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote:
>
>
> > If I have a pluggable indexer,
> > then on the querying side I need something (I'm not sure what/how)
> > that knows how to create the right demuxer (container) and codec
> > (decoder) to interact with whatever my indexing plugins wrote.
> >
> > So I don't think it's "out of bounds" to have *Query classes that know
> > to use the non-prx variant of a field when positions aren't needed.
> > Though I agree it makes things more complex.
> >
>
> Perhaps switching up files based on Query type isn't "out of bounds", but
> you get hit with a double whammy. First, there's the added complexity --
> which gets extra nasty when you consider that we wouldn't want this
> optimization to be on by default.  Second, you wind up with duplicated
> information in the index -- resulting in greater space requirements and
> increased indexing time.
>
> That's an awful lot of inconvenience for the sake of an optimization. I
> wouldn't bother.
>
> My initial thought was that the unified postings file format that KS is
> using now offers so many benefits that KS should go with it anyway, despite
> the increased costs for simple TermQueries.   But maybe we can nail the
> design for the divide-by-data-type format right now.


I think it depends on the performance difference.  At least, we should
not preclude this approach in our design, even if we don't make it
easy to do "out of the box".

> > > > This is like "column stride" vs "row stride" serialization
> > > > of a matrix.
> > > >
> > > > Relatively soon, though, we will all be on SSDs, so maybe this
> > > > locality argument becomes far less important ;)
> > >
> > > Yes, I've thought about that.  It defeats the phrase-query locality
> > > argument for unified postings files and recommends breaking things up
> > > logically by type of data into frq/prx/payload/whatever.
> > >
> > > Would it be possible to design a Posting plugin class that reads from
> > > multiple files?  I'm pretty sure the answer is yes.  It messes up the
> > > single-stream readRecord() method I've been detailing and might force
> > > Posting to maintain state.  But if Postings are scarce TermBuffer-style
> > > objects where values mutate, rather than populous Term-style objects where
> > > you need a new instance for each set of values, then it doesn't matter if
> > > they're large.
> > >
> > > If that could be done, I think it would be possible to retrofit the
> > > Posting/PostingList concept into Lucene without a file format change.  
> > > FWIW.
> >
> > I think this is possible.
>
> We should aggressively explore this concept.  By my tally, breaking up
> posting content into multiple files by data type has the most benefits and
> the fewest drawbacks.
>
>  Brainstorming...
>
> In devel branch KS, ScorePost_Read_Record reads three blocks of data from
> an InStream:
>
>   doc_num/freq
>   boost/norm byte
>   positions
>
> Let's say that instead of ScorePost_Read_Record operating on a passed-in
> InStream, we give ScorePosting a ScorePost_Next() method, and have it
> operate on three internally maintained InStreams:
>
>     void
>     ScorePost_next(ScorePosting *self)
>     {
>         u32_t  num_prox;
>         u32_t  position = 0;
>         u32_t *positions;
>         const u32_t doc_code = InStream_Read_C32(self->frq_in);
>         const u32_t doc_delta = doc_code >> 1;
>
>         /* Apply delta doc and retrieve freq. */
>         self->doc_num  += doc_delta;
>         if (doc_code & 1)
>             self->freq = 1;
>         else
>             self->freq = InStream_Read_C32(self->frq_in);
>
>         /* Decode boost/norm byte. */
>         self->impact = self->sim->norm_decoder[
>                 InStream_Read_U8(self->norms_in)
>             ];
>
>         /* Read positions. */
>         num_prox = self->freq;
>         if (num_prox > self->prox_cap) {
>             self->prox = REALLOCATE(self->prox, num_prox, u32_t);
>         }
>         positions = self->prox;
>         while (num_prox--) {
>             position += InStream_Read_C32(self->prx_in);
>             *positions++ = position;
>         }
>     }
>
> So, there we have a single operation, in a single class, defining a codec
> -- achieving my main goal, even while reading from multiple files.
>
> But check this out: we could feed the ScorePosting constructor the same
> InStream object three times, and it wouldn't know the difference.  So the
> same read operation can be used against both a multi-file format and a
> single file format.
>
>  Seeking might get a little weird, I suppose.

Maybe not?: if the container is only aware of the single InStream, and
say it's "indexed" with a multi-skip index, then when you ask
container to seek, it forwards the request to multi-skip which jumps
as close as it can, and then it asks the codec to skip docs until it's
at the requested docID.  Ie, the container can still be given a single
InStream, even though the codec "thinks" it's working with 3.

> I think a variant of that read op could be created against various versions
> of the Lucene file format going back, making it possible to isolate and
> archive obsolete codecs and clean up the container classes.

I like this approach.  It would allow us to decouple the codec from
how many (1, 2, 3) and which files are actually storing the data.

> > When reading an index, the
> > Posting/PostingList should be more like TermBuffer than Term.
> >
>
>  Yes.
>
> Now's a good time to remark on a difference between KS and Lucene when
> reading the equivalent of TermPositions: In KS, all positions are read in
> one grab during ScorePost_Read_Record -- there's no nextPosition() method.
> That was a tradeoff I made primarily for simplicity's sake, since it meant
> that PhraseScorer could be implemented with integer arrays and pointer math.
> Reduced CPU overhead was another theoretical benefit, but I've never
> benchmarked it.
>
> If you didn't want to do that but you still wanted to implement a
> PhraseScorer based around PostingList objects rather than TermPositions
> objects, you have a bit of a quandary: PostingList only advances in
> doc-sized increments, because it doesn't have a nextPosition() method.  So,
> nextPosition() would have to be implemented by ScorePosting:
>
>   // Iterate over docs and positions.
>   while (postingList.next()) {
>     ScorePosting posting = postingList.getPosting();
>     while (posting.nextPostition()) {
>       int position = posting.GetPosition();
>       ...
>     }
>   }
>
> If we did that, it would require a change in the behavior for
> ScorePost_Next(), above.

OK, I guess this is just the follow-through by KS on the same
philosophy above (precluding differentiating term-only vs
terms-and-positions queries).  I would think there is a non-trivial
performance cost for term-only queries in KS?

> > Thinking about a codec reading multiple files... I would also love
> > this: a codec that can write & read layered updates to the inverted
> > files.
>
> IMO, it should not be the codec object that performs actions like this --
> it should be a container class that the user knows nothing about.

Maybe this is something in between codec & container, because it'd
have to "understand" that streams/codecs it's managing iterate
through docIDs, favoring newer streams that have the same docID,
having older streams skip over their byte blob for a given docID if
they've been obsoleted, etc.

Still, I agree it's more like a container than a codec.  Clearly you'd
want to re-use this ability across multiple kinds of files
(non-sparse, fixed-size like norms; sparse, variable-size like the
int[] example below).

Actually, even the multi-level skipping container (used below in the
sparse, variable-size int[] example below) needs to be aware of
docIDs.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to