We are trying to sort the index tuples before inserting them into hash buckets, 
to improve build speed.  

Here is our plan:

1. Build a spool that contains all the index tuples to be inserted into the 
buckets. - this is done.

2. sort the index tuples in the spool according to the bucket number to which 
they should belong. This results in accessing a bucket once and only once.  

3. For (2) to work, we need an estimate of the number of buckets. This is done.

4. After sorting the index tuples, insert them into hash in bucket order. 

Our challenge: we need to determine the final bucket number for the itup (index 

1. to do the above, we need to apply a mask to the hash value of the index 
tuple. first, we calculate the hash value of the index tuple. then, we 
calculate the mask using:

(1 << (ceiling(log 2 (Estimate of buckets needed))))-1

So, if we need 6 buckets, the mask would be 7 or binary 111.  If we needed 100 
buckets, the mask would be 127 or binary 1111111.   If we AND this mask to the 
hash of the key, we only recognize the least   sig. bits needed to do the 

A 32 bit hash value may look like:  10110101001010101000010101010101

Let's say we just need 6 buckets, apply the mask 111 and we get:

10110101001010101000010101010101 (the hash value of the key)  
00000000000000000000000000000111 (the mask &)
00000000000000000000000000000101 (the resulting bucket number = 5)

If we needed 100 buckets, the calculation would look like:

10110101001010101000010101010101 (the hash value of the key)   
00000000000000000000000001111111 (the mask &)
00000000000000000000000001010101 (the resulting bucket number = 85) 

2. however, in practice when we apply a mask of value say, 1111(binary) our 
resulting bucket number is not evenly distrubuted.

3. do we look for a better hash function? or can we modify the existing hash?  
Comments are welcome.


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at


Reply via email to