On 7/26/06 10:14 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:

> Mark Kirkwood <[EMAIL PROTECTED]> writes:
>> An obvious deduction is that the TPCH dataset is much more amenable to
>> run compression than my synthetic Zipfian data was. The interesting
>> question is how well "real" datasets are run compressable,
> Yeah --- the back-of-the-envelope calculations I was making presupposed
> uniform random distribution, and we know that's often not realistic for
> real datasets.  A nonuniform distribution would probably mean that some
> of the bitmaps compress better-than-expected and others worse.  I have
> no idea how to model that and guess what the overall result is ...

The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
Wu et al gave an approximate answer for this question. Assume that there are
c distinct values. Let the i-th value has a probability of p_i, the number
of rows r, and the word size w. then the total size of the compressed bitmap
index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
both \sum's, i is from 1 to c.

The constraint for this equation is \sum(p_i)=1. Therefore, when all p_i are
equal, or the attribute has randomly distributed values, the size of the
bitmap index is the largest.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to