On Tue, Sep 17, 2013 at 1:24 PM, Smiley, David W. <[email protected]> wrote: > Lucene has got some new compressed DocIdSet implementations that are > technically very interesting and exciting: PForDeltaDocIdSet, WAH8DocIdSet, > EliasFanoDocIdSet, … any more? Yet it's difficult (at least for me) to > understand their pros/cons to know when to pick amongst them. They all seem > great yet why do we have 3? Only one is actually used by Lucene itself — > WAH8DocIdSet in CachingWrapperFilter. Javadocs are hit & miss; the JIRA > issues have lots of fascinating background but it's time consuming to > distill. I think it would be very useful to summarily document key > characteristics on class level javadocs — not so much implementation details > but information to help a user choose it versus another. And as a bonus a > table perhaps showing relative performance characteristics in package-level > javadocs. > > Related to this is, I'm wondering does it make sense for a codec's postings > (assuming no doc freq & no positions?) to be implemented as a serialized > version of one of these compressed doc id sets? I think it would be really > great, not just for compression but also because it might support > Terms.advance() since some of these compressed formats have indexes.
I think it makes sense; there's an issue for it: LUCENE-5052. Also, LUCENE-5123 (invert the PostingsFormat writing APIs) should make it easier, since you can iterate the postings for each term more than once, e.g. to decide in the first pass whether to encode using a bitset or not ... Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
