Marvin Humphrey wrote:
On Jul 17, 2008, at 1:57 PM, eks dev wrote:
is there any solution to have pure postings lists without
interleaved tf ... this eats a lot of CPU for VInt decoding on
dense terms (also doubles IO...) in our case.
To decompress integers really quickly, we shouldn't even be using
VInts. We should be using PForDelta as described in <http://www2008.org/papers/pdf/p387-zhangA.pdf
>. Encoding postings data with PForDelta is one of those rare
opportunities to speed up searching globally.
This looks very interesting, thanks for the link Marvin!
Once we have pluggable codec/containers this should be a drop-in to
Lucene. Though, it does put a constraint on how we design this in
that blocks of docIDs are decoded at a time vs single-doc-at-a-time
that SegmentTermDocs/Positions iteration now exposes.
It would be reasonably straightforward to integrate PForDelta if we
already had Flexible Indexing implemented. Maybe some damn-the-
spaghetti optimization junkie wants to try grafting it onto Lucene
before that, but it would be a hell of a lot easier to do it
afterwards.
Let's wait a bit :) But I'd love to get to the point where this is
possible ... I think what we need to get there is to factor out the
lowest level code that reads & writes the postings so that it invokes
separate container / codec classes to do the actual work. This is
also necessary to expose other use cases we've talked about in the
past, such as storing skip data and positions in separate files, or
inlined together, or some combination.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]