Re: Flexible indexing design

Marvin Humphrey Mon, 28 Apr 2008 08:21:55 -0700


On Apr 27, 2008, at 3:28 AM, Michael McCandless wrote:

Actually, I was picturing that the container does the seeking itself
(using skip data), to get "close" to the right point, and then it uses
the codec to step through single docs at a time until it's at or
beyond the right one.

I believe it makes sense to have the skip stream be the soleresponsibility of the container.

In any case, the skip stream would would be the *only* stream thecontainer knows about. All others streams would be internal to thecodec: frq, prx, etc.

Ie, no SkipDatum is created & passed to the codec, though likely
container should notify codec it just got skipped in case it has
internal state that cares.

But you'll have to seek frq, prx, etc, though -- right? Say you'vegot a large PostingList currently located near its top and you wantskip to halfway through. How do you do that without skipping theunderlying streams, which the codec knows about but the containerdoesn't?

If you don't skip the internal state of the codec including thelocations of its InStreams, you just have to keep calling next() on itover and over -- an un-optimized skipTo() implementation. That can'tbe what you want; presumably we're miscommunicating somehow.

Container is only aware of the single inStream, while codec can still
think its operating on 3 even if it's really 1 or 2.

I don't understand. If you have three streams, all of them are goingto have to get skipped, right?

The downside is that the codec object itself suddenly has to get alot
bigger to hold all the instreams.
But not many instances of the codec are created at once, right?


Yes.

That's not how I originally implemented Posting. It was a smallclass, and PostingList originally had a Bulk_Read() method that read 1kB worth of Posting objects into a ByteBuf, stacked end to end. Butwe agree that the codec class will need to be larger.

And even so one could plug in their own (single stream) codec ifneed be?


Sure.

The question is how we set up PostingList's iterator. In KS rightnow, SegPList_Next() calls Posting's Read_Record() method, which takesan InStream as an argument -- Posting doesn't maintain any streamsinternally.

As soon as the codec starts reading from an indeterminate number ofstreams, though, having the container pass them in for each read opwon't work any more. The codec will have to be responsible for alldata streams.

The only penalty for having a TermPositions object read
positions in bulk like that is memory footprint (since eachTermPositions
object needs a growable array of 32-bit integers to store the bulk
positions).


This is a tiny memory cost right?  A TermPositions instance gets
re-used each time you next() to the next term.

With large documents, the buffer size requirements can get pretty big-- consider how often the term "the" might appear in a 1000-pagenovel. Lucene and its relatives don't work very well with novel-sizeddocuments anyway, though, so for better or worse, I blew it off.

I think you also pay an added up-front (small) decoding cost in cases
where the consumer will only look at a subset of the positions before
next()'ing.  Eg a phrase search involving a rare term and a frequent
term.

Yes, you're right about that. The KS phrase scorer has less functioncall overhead, though -- it's a more limited design, with no 'slop'support, and it operates using pointer math on the arrays of positionsdirectly rather than having to make a lot of accessor calls.

My guess is that it's a wash, algorithm-wise. It seems likely thatfile format would have a significant effect, but I would be surprisedto see phrase scorer performance in KS and Lucene diverge all thatmuch as a result of the algorithmic implementation.

That's why I asserted that the main motivation for bulk-readingpositions in KS was simplicity, rather than performance or somethingelse.

But the good news is if this framework is plugglable, one can insert
their own codec to not do the up-front decoding of all positions per
term X doc.

Yes, that would work. You would need different Scorer implementationsto deal with codecs which read the same file format yet areimplemented differently... but that's a feature. :)

We'd need both PostingBuffer and Posting subclasses.


Yes.

OK, I think we're almost at the point where we can hack up prototypesfor Lucene implementations of PostingBuffer and PostingList that readthe current file format. I think seeing those would clarify things.

Ironically, I'm not sure exactly how Posting fits into the equation atread-time anymore, but I think that will work itself out as we getback to indexing.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to