Re: Documentation on the new compressed DocIdSet implementations

Adrien Grand Wed, 18 Sep 2013 01:10:25 -0700

Up to Lucene 4.4, CachingWrapperFilter cached filters with FixedBitSet
only by default, which sounds wasteful when the filter only matches a
few documents so I wanted to start experimenting with new
alternatives. I first worked on WAH8DocIdSet, then Paul Elschot
contributed EliasFanoDocIdSet and PForDeltaDocIdSet is inspired from
Kamikaze, a library which was providing compressed in-memory doc id
sets for a long time.


You can find a rough compression/performance comparison of these
DocIdSets at http://people.apache.org/~jpountz/doc_id_sets.html.

EliasFanoDocIdSet shows the best compression ratio on small sets and
an interesting decompression speed but I would wait for it to have an
index to use it[1], so that advance() is not linear of the number of
documents to skip (you can see on the last chart that it makes this
DocIdSet ~5000x slower than FixedBitSet and ~64 times slower than the
other compressed set implementations to skip over a very large number
of documents. Moreover it doesn't have mechanisms in order not to be
larger than a FixedBitSet on dense sets like WAH8 and PForDelta.

PForDeltaDocIdSet is a modified implementation of the pfor-delta
encoding to use unary encoding (like FixedBitSet) on blocs which are
dense.

WAH8DocIdSet is conceptually like a FixedBitSet where sequences which
are full of zeros or ones are encoding using a VInt. This allows it to
never be larger than a FixedBitSet (when the set is dense, the
encoding is actually the same as FixedBitSet) and to skip efficiently
when the set is dense (like FixedBitSet).

I think PForDeltaDocIdSet, WAH8DocIdSet and FixedBitSet would all be
good default implementations for caching in CachingWrapperFilter, but
WAH8DocIdSet has the advantage of compressing sparse and very dense
sets efficiently compared to FixedBitSet and has a better worst-case
performance compared to PForDeltaDocIdSet (eg. on advance(docID + 31),
WAH8DocIdSet is 4.3x slower than FixedBitSet in the worst case while
PForDeltaDocIdSet is 13.8x slower than FixedBitSet in the worst case).
If you have a very high cache churn on your CachingWrapperFilter
however, it could make sense to use PForDeltaDocIdSet instead since it
is ~2x faster to build when the density of the set is <1%.

If at some point we find that one of these DocIdSets is less
interesting than other ones, I'm perfectly fine with removing it or
moving it to lucene-misc.

About postings formats, this is indeed the same problem as postings
lists encoding and hopefully these experiments will help build an even
better postings format. LUCENE-5123 is very exciting regarding this
because it will allow to know the exact density of the postings list
before starting encoding it, allowing to choose the optimal encoding
based on the density.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Documentation on the new compressed DocIdSet implementations

Reply via email to