On Apr 24, 2008, at 4:47 AM, Michael McCandless wrote:
Seeking might get a little weird, I suppose.

Maybe not?: if the container is only aware of the single InStream, and
say it's "indexed" with a multi-skip index, then when you ask
container to seek, it forwards the request to multi-skip which jumps
as close as it can, and then it asks the codec to skip docs until it's
at the requested docID.  Ie, the container can still be given a single
InStream, even though the codec "thinks" it's working with 3.

So, if I follow you...

1) When the requested doc number is far enough away to trigger skipping, segmentPostingList.skipTo() would operate on the skip stream and generate an opaque SkipDatum object (or something like that), which it would pass to the codec object.

2)The codec object would use the contents of the SkipDatum object -- an array of file pointers large enough to accommodate all of the InStreams, plus payload-type information as applicable -- to update itself into a consistent state just shy of or at the target doc.

3) The codec object will iterate through docs until it's at or past the target.

Here's what's weird. Say the codec is only operating on 1 stream but it thinks it's operating on 3. Upon receiving the SkipDatum with 3 file pointers, it will seek the same InStream 3 times.

That will only work if the file pointers are all the same -- which would be strange because then the second and third file pointers won't actually be marking the real data.

Or perhaps the SkipDatum object will only contain one file pointer per "real" stream and the codec object will seek until it runs out of pointers in the stack? i.e. It has three instreams, it receives a SkipDatum with only one file pointer, it seeks the first stream, then bails out and returns, ignoring the others.

I think a variant of that read op could be created against various versions of the Lucene file format going back, making it possible to isolate and
archive obsolete codecs and clean up the container classes.

I like this approach.  It would allow us to decouple the codec from
how many (1, 2, 3) and which files are actually storing the data.

The downside is that the codec object itself suddenly has to get a lot bigger to hold all the instreams.

When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.


Yes.

Now's a good time to remark on a difference between KS and Lucene when reading the equivalent of TermPositions: In KS, all positions are read in one grab during ScorePost_Read_Record -- there's no nextPosition() method. That was a tradeoff I made primarily for simplicity's sake, since it meant that PhraseScorer could be implemented with integer arrays and pointer math.
Reduced CPU overhead was another theoretical benefit, but I've never
benchmarked it.

If you didn't want to do that but you still wanted to implement a
PhraseScorer based around PostingList objects rather than TermPositions
objects, you have a bit of a quandary: PostingList only advances in
doc-sized increments, because it doesn't have a nextPosition() method. So,
nextPosition() would have to be implemented by ScorePosting:

 // Iterate over docs and positions.
 while (postingList.next()) {
   ScorePosting posting = postingList.getPosting();
   while (posting.nextPostition()) {
     int position = posting.GetPosition();
     ...
   }
 }

If we did that, it would require a change in the behavior for
ScorePost_Next(), above.

OK, I guess this is just the follow-through by KS on the same
philosophy above (precluding differentiating term-only vs
terms-and-positions queries).

Not exactly. While the devel branch of KS uses PostingList and the unified postings format, the maint branch uses a fairly close variant on the Lucene file format and a modified version of the TermDocs family. The equivalent of TermPositions in KS maint *also* reads positions in bulk, just like KS devel does. The only penalty for having a TermPositions object read positions in bulk like that is memory footprint (since each TermPositions object needs a growable array of 32-bit integers to store the bulk positions).

Reading positions in bulk or not is a decision that can be made independently of either the decision to go with a unified postings format or the decision to implement PostingList. I provided the information on KinoSearch's behavior because I thought the code samples for PostingList I'd supplied begged the question, "How do you iterate through positions if PostingList doesn't know about them?"

I would think there is a non-trivial
performance cost for term-only queries in KS?

I'm not sure what the size of the effect is. KS has its old indexing benchmarking app, but it doesn't have a search benchmarking app and I don't plan to write one. I figure it's more profitable to finish the C porting of KS, write a Java binding and try to hook into the Lucene contrib benchmarking code than it is to either port the whole thing or write one from scratch.

The unified postings format seems likely to suffer some penalty on simple term queries and also likely to make up some of that ground on phrase queries. We originally thought motivated users could compensate by spec'ing a simpler format for some fields, but now that I've actually implemented the unified format, I just don't see that approach as practical.

So now, I've swung back to favoring the divide-by-data-type approach, but I want to keep the codec/container roles.

Read-time isn't a problem. We can outfit the codec object with multiple InStreams. PostingList can continue to be a container which advances doc-at-a-time (regardless of whether we read positions in bulk a la KinoSearch or defer that till later a la Lucene).

However, if we add all that stuff, the codec object gets a lot bigger. That means Posting, as I've envisioned and implemented it, is no longer suitable. We'd need both PostingBuffer and Posting subclasses.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to