On Apr 24, 2008, at 4:47 AM, Michael McCandless wrote:
Seeking might get a little weird, I suppose.
Maybe not?: if the container is only aware of the single InStream, and
say it's "indexed" with a multi-skip index, then when you ask
container to seek, it forwards the request to multi-skip which jumps
as close as it can, and then it asks the codec to skip docs until it's
at the requested docID. Ie, the container can still be given a single
InStream, even though the codec "thinks" it's working with 3.
So, if I follow you...
1) When the requested doc number is far enough away to trigger
skipping, segmentPostingList.skipTo() would operate on the skip stream
and generate an opaque SkipDatum object (or something like that),
which it would pass to the codec object.
2)The codec object would use the contents of the SkipDatum object --
an array of file pointers large enough to accommodate all of the
InStreams, plus payload-type information as applicable -- to update
itself into a consistent state just shy of or at the target doc.
3) The codec object will iterate through docs until it's at or past
the target.
Here's what's weird. Say the codec is only operating on 1 stream but
it thinks it's operating on 3. Upon receiving the SkipDatum with 3
file pointers, it will seek the same InStream 3 times.
That will only work if the file pointers are all the same -- which
would be strange because then the second and third file pointers won't
actually be marking the real data.
Or perhaps the SkipDatum object will only contain one file pointer per
"real" stream and the codec object will seek until it runs out of
pointers in the stack? i.e. It has three instreams, it receives a
SkipDatum with only one file pointer, it seeks the first stream, then
bails out and returns, ignoring the others.
I think a variant of that read op could be created against various
versions
of the Lucene file format going back, making it possible to isolate
and
archive obsolete codecs and clean up the container classes.
I like this approach. It would allow us to decouple the codec from
how many (1, 2, 3) and which files are actually storing the data.
The downside is that the codec object itself suddenly has to get a lot
bigger to hold all the instreams.
When reading an index, the
Posting/PostingList should be more like TermBuffer than Term.
Yes.
Now's a good time to remark on a difference between KS and Lucene
when
reading the equivalent of TermPositions: In KS, all positions are
read in
one grab during ScorePost_Read_Record -- there's no nextPosition()
method.
That was a tradeoff I made primarily for simplicity's sake, since
it meant
that PhraseScorer could be implemented with integer arrays and
pointer math.
Reduced CPU overhead was another theoretical benefit, but I've never
benchmarked it.
If you didn't want to do that but you still wanted to implement a
PhraseScorer based around PostingList objects rather than
TermPositions
objects, you have a bit of a quandary: PostingList only advances in
doc-sized increments, because it doesn't have a nextPosition()
method. So,
nextPosition() would have to be implemented by ScorePosting:
// Iterate over docs and positions.
while (postingList.next()) {
ScorePosting posting = postingList.getPosting();
while (posting.nextPostition()) {
int position = posting.GetPosition();
...
}
}
If we did that, it would require a change in the behavior for
ScorePost_Next(), above.
OK, I guess this is just the follow-through by KS on the same
philosophy above (precluding differentiating term-only vs
terms-and-positions queries).
Not exactly. While the devel branch of KS uses PostingList and the
unified postings format, the maint branch uses a fairly close variant
on the Lucene file format and a modified version of the TermDocs
family. The equivalent of TermPositions in KS maint *also* reads
positions in bulk, just like KS devel does. The only penalty for
having a TermPositions object read positions in bulk like that is
memory footprint (since each TermPositions object needs a growable
array of 32-bit integers to store the bulk positions).
Reading positions in bulk or not is a decision that can be made
independently of either the decision to go with a unified postings
format or the decision to implement PostingList. I provided the
information on KinoSearch's behavior because I thought the code
samples for PostingList I'd supplied begged the question, "How do you
iterate through positions if PostingList doesn't know about them?"
I would think there is a non-trivial
performance cost for term-only queries in KS?
I'm not sure what the size of the effect is. KS has its old indexing
benchmarking app, but it doesn't have a search benchmarking app and I
don't plan to write one. I figure it's more profitable to finish the
C porting of KS, write a Java binding and try to hook into the Lucene
contrib benchmarking code than it is to either port the whole thing or
write one from scratch.
The unified postings format seems likely to suffer some penalty on
simple term queries and also likely to make up some of that ground on
phrase queries. We originally thought motivated users could
compensate by spec'ing a simpler format for some fields, but now that
I've actually implemented the unified format, I just don't see that
approach as practical.
So now, I've swung back to favoring the divide-by-data-type approach,
but I want to keep the codec/container roles.
Read-time isn't a problem. We can outfit the codec object with
multiple InStreams. PostingList can continue to be a container which
advances doc-at-a-time (regardless of whether we read positions in
bulk a la KinoSearch or defer that till later a la Lucene).
However, if we add all that stuff, the codec object gets a lot
bigger. That means Posting, as I've envisioned and implemented it, is
no longer suitable. We'd need both PostingBuffer and Posting
subclasses.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]