Re: Could positions/payloads in SegmentMerger be copied directly?

Michael McCandless Wed, 24 Sep 2008 07:19:42 -0700


Paul Elschot wrote:

Op Tuesday 23 September 2008 20:26:18 schreef Michael McCandless:

Paul Elschot wrote:

So, adding a document offset from the  documents/frequencies
into the positions/payloads for each document would allow:
-  bulk copying of the position/payloads during merging, and
-  a more efficient implementation of TermPositions.skipTo()
 in that decoding the positions from the last available skip
 document to the target of skipTo() could be avoided.
Is that correct?


Yes, though this would also add cost of computing/writing/reading
that new offset, and would increase the index size.

That would indeed be invasive.


Yes.  I think our time would likely be better spent working on using
PForDelta for freq/prox.


To change the prox data to PForDelta, it's nice to be have
bulk copies on prox working first. That would allow to change
the total size of the prox data easily.

But it appears to be easier to start with the doc/freq data, add
more prox pointers there, and then change the prox data.

PForDelta is fundamentally different from the existing index data
because an encoded number cannot be accessed on a byte
border. I don't know yet how to deal with that in the index
data structures.


PForDelta encodes multiples of 32 ints at a time; so, the pointers
stored in the term dict, and in skip data, would presumably have to be
block number (or byte position in the file) plus offset within the
block.

And then an entire block must be fully decoded when loaded (I don't
think it's easy to partially decode with PForDelta, unless the block
luckily had no exceptions?), and then you start from the
offset-within-block you need.

I think a single block would hold more than one term's postings data
in general.  Ie these blocks are like "pages" in virtual memory.

Also I wonder how PForDelta would impact performance of queries that
rely heavily on skipping (AND queries), because the entire block must
be decoded to read a few of its ints.

However, with PForDelta I don't think we'd be able to do byte block
copying when merging, unless we were willing to keep the "seams" of
past merges present in the index files (the invasive change I was
referring to), and, no deletions applied.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Could positions/payloads in SegmentMerger be copied directly?

Reply via email to