Re: [HACKERS] On-disk bitmap index patch


On 7/26/06 11:50 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:

> "Jie Zhang" <[EMAIL PROTECTED]> writes:
>> On 7/26/06 10:14 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
>>> ... A nonuniform distribution would probably mean that some
>>> of the bitmaps compress better-than-expected and others worse.  I have
>>> no idea how to model that and guess what the overall result is ...
>
>> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
>> Wu et al gave an approximate answer for this question. Assume that there are
>> c distinct values. Let the i-th value has a probability of p_i, the number
>> of rows r, and the word size w. then the total size of the compressed bitmap
>> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
>> both \sum's, i is from 1 to c.
>
> Hm, but that's still begging the question no?  It's still assuming that
> any one value is uniformly distributed.  ISTM the cases that would break
> my simplistic calculation involve clustering of particular values, such
> that some areas of the bitmap are all-zero while other areas have lots
> of ones.

Yes, you are right -- each value is still uniformly distributed. But this
will be the worst case in terms of the size of a bitmap vector. As for how
to model the size of a bitmap vector for an non-uniformly distributed value,
that's a good question. I don't really know. But we do know the best case
and the worse case.