Hi, On Sat, Oct 05, 2013 at 08:22:54PM +0200, Tomas Vondra wrote: > I'm on 64-bit architecture and the example works with int32, which means > the sizes should be about this: > > hash_element_t => 20B > hash_bucket_t => 4B + (20B * items in the bucket [in steps of 5]) > hash_table_t => 4B + space for buckets > > In the example above, there's ~20k unique values in each group. The > threshold is 20 items per bucket on average, so that's 1024 buckets, and > the buckets are almost full. > > So for single group, the hash table size is about > > 4B + 1024 * (4B + 20 * 20B) = 413700B = ~ 400 kB > > There are 4000 groups, so the total estimate is ~1.6GB in total. > > However when executed (9.2, 9.3 and HEAD behave exactly the same), the > query consumes almost ~5GB of RAM (excluding shared buffers).
I think the missing thing is the memory allocator bookkeeping overhead. You're assuming that hash_element_t.value takes 8B for the pointer and 4B for the value itself, but using malloc it takes another at least 20 bytes, and from a quick glance at backend/utils/mmgr/aset.c it seems that palloc is certainly not without its overhead either. Also, each additional level of pointers adds execution overhead and increases the likelihood of cache misses. I'd suggest a few improvements, if I may: 1. Drop hash_element_t altogether, store length in hash_bucket_t and alloc hash_bucket_t.items of size nitems * length bytes. I doubt that storing the hash values has a benefit worth the storage and code complexity overhead (you're storing fixed-size ints, not large blobs that are expensive to compare and hash). 2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2]. For fun, try not hashing those ints at all and see how that performs (that, I think, is what you get from HashSet<int> in Java/C#). 3. Consider dropping buckets in favor of open addressing (linear probing, quadratic, whatever). This avoids another level of pointer indirection. It's been a few years since I've done stuff this low level, so I won't go into suggesting a different data structure -- I have honestly no idea what's the best way to count the number of distinct integers in a list. [1] http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash [2] http://en.wikipedia.org/wiki/Jenkins_hash_function Best regards, -- Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers