Re: Flexible indexing design

Marvin Humphrey Thu, 24 Apr 2008 15:54:56 -0700


On Apr 24, 2008, at 4:47 AM, Michael McCandless wrote:

Seeking might get a little weird, I suppose.


Maybe not?: if the container is only aware of the single InStream, and
say it's "indexed" with a multi-skip index, then when you ask
container to seek, it forwards the request to multi-skip which jumps
as close as it can, and then it asks the codec to skip docs until it's
at the requested docID.  Ie, the container can still be given a single
InStream, even though the codec "thinks" it's working with 3.


So, if I follow you...

1) When the requested doc number is far enough away to triggerskipping, segmentPostingList.skipTo() would operate on the skip streamand generate an opaque SkipDatum object (or something like that),which it would pass to the codec object.

2)The codec object would use the contents of the SkipDatum object --an array of file pointers large enough to accommodate all of theInStreams, plus payload-type information as applicable -- to updateitself into a consistent state just shy of or at the target doc.

3) The codec object will iterate through docs until it's at or pastthe target.

Here's what's weird. Say the codec is only operating on 1 stream butit thinks it's operating on 3. Upon receiving the SkipDatum with 3file pointers, it will seek the same InStream 3 times.

That will only work if the file pointers are all the same -- whichwould be strange because then the second and third file pointers won'tactually be marking the real data.

Or perhaps the SkipDatum object will only contain one file pointer per"real" stream and the codec object will seek until it runs out ofpointers in the stack? i.e. It has three instreams, it receives aSkipDatum with only one file pointer, it seeks the first stream, thenbails out and returns, ignoring the others.

I think a variant of that read op could be created against variousversionsof the Lucene file format going back, making it possible to isolateand
archive obsolete codecs and clean up the container classes.
I like this approach.  It would allow us to decouple the codec from
how many (1, 2, 3) and which files are actually storing the data.

The downside is that the codec object itself suddenly has to get a lotbigger to hold all the instreams.

When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.
Yes.
Now's a good time to remark on a difference between KS and Lucenewhenreading the equivalent of TermPositions: In KS, all positions areread inone grab during ScorePost_Read_Record -- there's no nextPosition()method.That was a tradeoff I made primarily for simplicity's sake, sinceit meantthat PhraseScorer could be implemented with integer arrays andpointer math.
Reduced CPU overhead was another theoretical benefit, but I've never
benchmarked it.

If you didn't want to do that but you still wanted to implement a
PhraseScorer based around PostingList objects rather thanTermPositions
objects, you have a bit of a quandary: PostingList only advances in
doc-sized increments, because it doesn't have a nextPosition()method. So,
nextPosition() would have to be implemented by ScorePosting:

 // Iterate over docs and positions.
 while (postingList.next()) {
   ScorePosting posting = postingList.getPosting();
   while (posting.nextPostition()) {
     int position = posting.GetPosition();
     ...
   }
 }

If we did that, it would require a change in the behavior for
ScorePost_Next(), above.
OK, I guess this is just the follow-through by KS on the same
philosophy above (precluding differentiating term-only vs
terms-and-positions queries).

Not exactly. While the devel branch of KS uses PostingList and theunified postings format, the maint branch uses a fairly close varianton the Lucene file format and a modified version of the TermDocsfamily. The equivalent of TermPositions in KS maint *also* readspositions in bulk, just like KS devel does. The only penalty forhaving a TermPositions object read positions in bulk like that ismemory footprint (since each TermPositions object needs a growablearray of 32-bit integers to store the bulk positions).

Reading positions in bulk or not is a decision that can be madeindependently of either the decision to go with a unified postingsformat or the decision to implement PostingList. I provided theinformation on KinoSearch's behavior because I thought the codesamples for PostingList I'd supplied begged the question, "How do youiterate through positions if PostingList doesn't know about them?"

I would think there is a non-trivial
performance cost for term-only queries in KS?

I'm not sure what the size of the effect is. KS has its old indexingbenchmarking app, but it doesn't have a search benchmarking app and Idon't plan to write one. I figure it's more profitable to finish theC porting of KS, write a Java binding and try to hook into the Lucenecontrib benchmarking code than it is to either port the whole thing orwrite one from scratch.

The unified postings format seems likely to suffer some penalty onsimple term queries and also likely to make up some of that ground onphrase queries. We originally thought motivated users couldcompensate by spec'ing a simpler format for some fields, but now thatI've actually implemented the unified format, I just don't see thatapproach as practical.

So now, I've swung back to favoring the divide-by-data-type approach,but I want to keep the codec/container roles.

Read-time isn't a problem. We can outfit the codec object withmultiple InStreams. PostingList can continue to be a container whichadvances doc-at-a-time (regardless of whether we read positions inbulk a la KinoSearch or defer that till later a la Lucene).

However, if we add all that stuff, the codec object gets a lotbigger. That means Posting, as I've envisioned and implemented it, isno longer suitable. We'd need both PostingBuffer and Postingsubclasses.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing design

Reply via email to