# [HACKERS] Building Hash Index by Presorting Tuples

Hi,

We are trying to sort the index tuples before inserting them into hash buckets,
to improve build speed.

Here is our plan:

1. Build a spool that contains all the index tuples to be inserted into the
buckets. - this is done.

2. sort the index tuples in the spool according to the bucket number to which
they should belong. This results in accessing a bucket once and only once.

3. For (2) to work, we need an estimate of the number of buckets. This is done.

4. After sorting the index tuples, insert them into hash in bucket order.

Our challenge: we need to determine the final bucket number for the itup (index
tuple).

1. to do the above, we need to apply a mask to the hash value of the index
tuple. first, we calculate the hash value of the index tuple. then, we

(1 << (ceiling(log 2 (Estimate of buckets needed))))-1

So, if we need 6 buckets, the mask would be 7 or binary 111.  If we needed 100
buckets, the mask would be 127 or binary 1111111.   If we AND this mask to the
hash of the key, we only recognize the least   sig. bits needed to do the
compare.

A 32 bit hash value may look like:  10110101001010101000010101010101

Let's say we just need 6 buckets, apply the mask 111 and we get:

10110101001010101000010101010101 (the hash value of the key)
--------------------------------
00000000000000000000000000000101 (the resulting bucket number = 5)

If we needed 100 buckets, the calculation would look like:

10110101001010101000010101010101 (the hash value of the key)
--------------------------------
00000000000000000000000001010101 (the resulting bucket number = 85)

2. however, in practice when we apply a mask of value say, 1111(binary) our
resulting bucket number is not evenly distrubuted.

3. do we look for a better hash function? or can we modify the existing hash?