On 7/26/06 11:50 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
> "Jie Zhang" <[EMAIL PROTECTED]> writes: >> On 7/26/06 10:14 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote: >>> ... A nonuniform distribution would probably mean that some >>> of the bitmaps compress better-than-expected and others worse. I have >>> no idea how to model that and guess what the overall result is ... > >> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng >> Wu et al gave an approximate answer for this question. Assume that there are >> c distinct values. Let the i-th value has a probability of p_i, the number >> of rows r, and the word size w. then the total size of the compressed bitmap >> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in >> both \sum's, i is from 1 to c. > > Hm, but that's still begging the question no? It's still assuming that > any one value is uniformly distributed. ISTM the cases that would break > my simplistic calculation involve clustering of particular values, such > that some areas of the bitmap are all-zero while other areas have lots > of ones. Yes, you are right -- each value is still uniformly distributed. But this will be the worst case in terms of the size of a bitmap vector. As for how to model the size of a bitmap vector for an non-uniformly distributed value, that's a good question. I don't really know. But we do know the best case and the worse case. ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq