Re: Index without tf, anyone?

Michael McCandless Fri, 18 Jul 2008 03:03:58 -0700


Marvin Humphrey wrote:

On Jul 17, 2008, at 1:57 PM, eks dev wrote:
is there any solution to have pure postings lists withoutinterleaved tf ... this eats a lot of CPU for VInt decoding ondense terms (also doubles IO...) in our case.
To decompress integers really quickly, we shouldn't even be usingVInts. We should be using PForDelta as described in <http://www2008.org/papers/pdf/p387-zhangA.pdf>. Encoding postings data with PForDelta is one of those rareopportunities to speed up searching globally.


This looks very interesting, thanks for the link Marvin!

Once we have pluggable codec/containers this should be a drop-in toLucene. Though, it does put a constraint on how we design this inthat blocks of docIDs are decoded at a time vs single-doc-at-a-timethat SegmentTermDocs/Positions iteration now exposes.

It would be reasonably straightforward to integrate PForDelta if wealready had Flexible Indexing implemented. Maybe some damn-the-spaghetti optimization junkie wants to try grafting it onto Lucenebefore that, but it would be a hell of a lot easier to do itafterwards.

Let's wait a bit :) But I'd love to get to the point where this ispossible ... I think what we need to get there is to factor out thelowest level code that reads & writes the postings so that it invokesseparate container / codec classes to do the actual work. This isalso necessary to expose other use cases we've talked about in thepast, such as storing skip data and positions in separate files, orinlined together, or some combination.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index without tf, anyone?

Reply via email to