# Re: [HACKERS] On-disk bitmap index patch

On Thu, Jul 27, 2006 at 09:13:21AM -0700, Jie Zhang wrote:
> On 7/26/06 11:50 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
> > "Jie Zhang" <[EMAIL PROTECTED]> writes:
> >> On 7/26/06 10:14 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
> >>> ... A nonuniform distribution would probably mean that some
> >>> of the bitmaps compress better-than-expected and others worse.  I have
> >>> no idea how to model that and guess what the overall result is ...
> >
> >> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
> >> Wu et al gave an approximate answer for this question. Assume that there
> >> are
> >> c distinct values. Let the i-th value has a probability of p_i, the number
> >> of rows r, and the word size w. then the total size of the compressed
> >> bitmap
> >> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where
> >> in
> >> both \sum's, i is from 1 to c.
> >
> > Hm, but that's still begging the question no?  It's still assuming that
> > any one value is uniformly distributed.  ISTM the cases that would break
> > my simplistic calculation involve clustering of particular values, such
> > that some areas of the bitmap are all-zero while other areas have lots
> > of ones.
>
> Yes, you are right -- each value is still uniformly distributed. But this
> will be the worst case in terms of the size of a bitmap vector. As for how
> to model the size of a bitmap vector for an non-uniformly distributed value,
> that's a good question. I don't really know. But we do know the best case
> and the worse case.

If the usefulness of bitmap indexes is still in doubt, could someone at
Greenplum provide data from actual data warehouses from actual
customers?
--
Jim C. Nasby, Sr. Engineering Consultant      [EMAIL PROTECTED]
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461