Marvin Humphrey wrote:

On Jul 17, 2008, at 1:57 PM, eks dev wrote:

is there any solution to have pure postings lists without interleaved tf ... this eats a lot of CPU for VInt decoding on dense terms (also doubles IO...) in our case.

To decompress integers really quickly, we shouldn't even be using VInts. We should be using PForDelta as described in <http://www2008.org/papers/pdf/p387-zhangA.pdf >. Encoding postings data with PForDelta is one of those rare opportunities to speed up searching globally.

This looks very interesting, thanks for the link Marvin!

Once we have pluggable codec/containers this should be a drop-in to Lucene. Though, it does put a constraint on how we design this in that blocks of docIDs are decoded at a time vs single-doc-at-a-time that SegmentTermDocs/Positions iteration now exposes.

It would be reasonably straightforward to integrate PForDelta if we already had Flexible Indexing implemented. Maybe some damn-the- spaghetti optimization junkie wants to try grafting it onto Lucene before that, but it would be a hell of a lot easier to do it afterwards.

Let's wait a bit :) But I'd love to get to the point where this is possible ... I think what we need to get there is to factor out the lowest level code that reads & writes the postings so that it invokes separate container / codec classes to do the actual work. This is also necessary to expose other use cases we've talked about in the past, such as storing skip data and positions in separate files, or inlined together, or some combination.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to