Up to Lucene 4.4, CachingWrapperFilter cached filters with FixedBitSet only by default, which sounds wasteful when the filter only matches a few documents so I wanted to start experimenting with new alternatives. I first worked on WAH8DocIdSet, then Paul Elschot contributed EliasFanoDocIdSet and PForDeltaDocIdSet is inspired from Kamikaze, a library which was providing compressed in-memory doc id sets for a long time.
You can find a rough compression/performance comparison of these DocIdSets at http://people.apache.org/~jpountz/doc_id_sets.html. EliasFanoDocIdSet shows the best compression ratio on small sets and an interesting decompression speed but I would wait for it to have an index to use it[1], so that advance() is not linear of the number of documents to skip (you can see on the last chart that it makes this DocIdSet ~5000x slower than FixedBitSet and ~64 times slower than the other compressed set implementations to skip over a very large number of documents. Moreover it doesn't have mechanisms in order not to be larger than a FixedBitSet on dense sets like WAH8 and PForDelta. PForDeltaDocIdSet is a modified implementation of the pfor-delta encoding to use unary encoding (like FixedBitSet) on blocs which are dense. WAH8DocIdSet is conceptually like a FixedBitSet where sequences which are full of zeros or ones are encoding using a VInt. This allows it to never be larger than a FixedBitSet (when the set is dense, the encoding is actually the same as FixedBitSet) and to skip efficiently when the set is dense (like FixedBitSet). I think PForDeltaDocIdSet, WAH8DocIdSet and FixedBitSet would all be good default implementations for caching in CachingWrapperFilter, but WAH8DocIdSet has the advantage of compressing sparse and very dense sets efficiently compared to FixedBitSet and has a better worst-case performance compared to PForDeltaDocIdSet (eg. on advance(docID + 31), WAH8DocIdSet is 4.3x slower than FixedBitSet in the worst case while PForDeltaDocIdSet is 13.8x slower than FixedBitSet in the worst case). If you have a very high cache churn on your CachingWrapperFilter however, it could make sense to use PForDeltaDocIdSet instead since it is ~2x faster to build when the density of the set is <1%. If at some point we find that one of these DocIdSets is less interesting than other ones, I'm perfectly fine with removing it or moving it to lucene-misc. About postings formats, this is indeed the same problem as postings lists encoding and hopefully these experiments will help build an even better postings format. LUCENE-5123 is very exciting regarding this because it will allow to know the exact density of the postings list before starting encoding it, allowing to choose the optimal encoding based on the density. -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
