Dear PostgreSQL developers,

I am re-sending this to keep this last change to the
internal hash function on the radar.

Ken
............
Sorry about the delay for this update to the new hash
index implementation. I was trying to get the WAL logging
in place and forgot to post the actual patch. The WAL
for hash indexes will need to wait for 8.5, but I did
want to add back in the piece of the Bob Jenkins 2006
hash function that was stripped out of the initial
patch on application due to concerns about the randomness
of the resulting hash values. Here is a re-post of my
initial findings comparing the old/new Jenkins hash
from lookup2 and lookup3. I have added a third column
containing the results for the hash_any() resulting
from the attached patch as well as simple timing test
for a DB reindex both before and after patching.

Also attached is a simple documentation patch updating
the note attached to the hash index description.

Regards,
Ken
----------------------------------------------------
Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash input                            old    new  newv2
----------                            ---    ---  -----
cracklib                              338    316  338
cracklib x 2 (i.e. clibclib)          305    319  300
cracklib x 3 (clibclibclib)           323    329  315
cracklib x 10                         302    310  329
cracklib x 100                        350    335  298
cracklib x 1000                       314    309  315
cracklib x 100 truncated to char(100) 311    327  320

uint32 from 1-1648379                 309    319  347
(uint32 1-1948379)*256                309    314  304
(uint32 1-1948379)*16                 310    314  324
"a"uint32 (i.e. a00001,a0002...)      320    321  312

uint32uint32 (i.e. uint64)            321    287  309

The different result columns are old = Jenkins 1996 hash
function(lookup2.c), new = Jenkins 2006 hash function
(lookup3.c), and newv2 = adaptation of current hash_any()
to incorporate the separate mix()/final() functions. As
you can see from the results, spliting the mix() and final()
apart does not result in any perceptible loss of randomness
in the hash assignment. I also ran a crude timing for a
reindex of the following database:

CREATE TABLE dict (word text);
CREATE INDEX wordhash ON dict USING hash (word);
INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
... (21 times)

REINDEX TABLE
...

The average time to reindex the table using our current
hash_any() without the separate mix()/final() was 1696ms
and 1482ms with the separate mix()/final() stages giving
almost 13% better performance for this stupid metric.
--- indices.sgml        2008-10-13 14:40:06.000000000 -0500
+++ indices.sgml.NEW    2008-11-04 12:42:35.000000000 -0600
@@ -190,13 +190,11 @@
 
   <note>
    <para>
-    Testing has shown <productname>PostgreSQL</productname>'s hash
-    indexes to perform no better than B-tree indexes, and the
-    index size and build time for hash indexes is much worse.
-    Furthermore, hash index operations are not presently WAL-logged,
+    <productname>PostgreSQL</productname>'s hash indexes provide
+    the fast O(1) lookups, even for very large objects.
+    Hash index operations are not presently WAL-logged,
     so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash.
-    For these reasons, hash index use is presently discouraged.
+    after a database crash. 
    </para>
   </note>
 
--- hashfunc.c.ORIG     2008-09-03 13:07:14.000000000 -0500
+++ hashfunc.c.NEW      2008-11-04 08:36:16.000000000 -0600
@@ -200,39 +200,94 @@
  * hash function, see http://burtleburtle.net/bob/hash/doobs.html,
  * or Bob's article in Dr. Dobb's Journal, Sept. 1997.
  *
- * In the current code, we have adopted an idea from Bob's 2006 update
- * of his hash function, which is to fetch the data a word at a time when
- * it is suitably aligned.  This makes for a useful speedup, at the cost
- * of having to maintain four code paths (aligned vs unaligned, and
- * little-endian vs big-endian).  Note that we have NOT adopted his newer
- * mix() function, which is faster but may sacrifice some randomness.
+ * In the current code, we have adopted Bob's 2006 update of his hash
+ * which fetches the data a word at a time when it is suitably aligned.
+ * This makes for a useful speedup, at the cost of having to maintain
+ * four code paths (aligned vs unaligned, and little-endian vs big-endian).
+ * It also two separate mixing functions mix() and final() instead
+ * of a single multi-purpose function, that is slower as a result.
  */
 
 /* Get a bit mask of the bits set in non-uint32 aligned addresses */
 #define UINT32_ALIGN_MASK (sizeof(uint32) - 1)
+#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
 
 /*----------
  * mix -- mix 3 32-bit values reversibly.
- * For every delta with one or two bits set, and the deltas of all three
- * high bits or all three low bits, whether the original value of a,b,c
- * is almost all zero or is uniformly distributed,
- * - If mix() is run forward or backward, at least 32 bits in a,b,c
- *      have at least 1/4 probability of changing.
- * - If mix() is run forward, every bit of c will change between 1/3 and
- *      2/3 of the time.  (Well, 22/100 and 78/100 for some 2-bit deltas.)
+ *
+ * This is reversible, so any information in (a,b,c) before mix() is
+ * still in (a,b,c) after mix().
+ *
+ * If four pairs of (a,b,c) inputs are run through mix(), or through
+ * mix() in reverse, there are at least 32 bits of the output that
+ * are sometimes the same for one pair and different for another pair.
+ * This was tested for:
+ * * pairs that differed by one bit, by two bits, in any combination
+ *   of top bits of (a,b,c), or in any combination of bottom bits of
+ *   (a,b,c).
+ * * "differ" is defined as +, -, ^, or ~^.  For + and -, I transformed
+ *   the output delta to a Gray code (a^(a>>1)) so a string of 1's (as
+ *   is commonly produced by subtraction) look like a single 1-bit
+ *   difference.
+ * * the base values were pseudorandom, all zero but one bit set, or
+ *   all zero plus a counter that starts at zero.
+ * 
+ * This does not achieve avalanche.  There are input bits of (a,b,c)
+ * that fail to affect some output bits of (a,b,c), especially of a.  The
+ * most thoroughly mixed value is c, but it doesn't really even achieve
+ * avalanche in c. 
+ * 
+ * This allows some parallelism.  Read-after-writes are good at doubling
+ * the number of bits affected, so the goal of mixing pulls in the opposite
+ * direction as the goal of parallelism.  I did what I could.  Rotates
+ * seem to cost as much as shifts on every machine I could lay my hands
+ * on, and rotates are much kinder to the top and bottom bits, so I used
+ * rotates.
  *----------
  */
 #define mix(a,b,c) \
 { \
-  a -= b; a -= c; a ^= ((c)>>13); \
-  b -= c; b -= a; b ^= ((a)<<8); \
-  c -= a; c -= b; c ^= ((b)>>13); \
-  a -= b; a -= c; a ^= ((c)>>12);  \
-  b -= c; b -= a; b ^= ((a)<<16); \
-  c -= a; c -= b; c ^= ((b)>>5); \
-  a -= b; a -= c; a ^= ((c)>>3);       \
-  b -= c; b -= a; b ^= ((a)<<10); \
-  c -= a; c -= b; c ^= ((b)>>15); \
+  a -= c;  a ^= rot(c, 4);  c += b; \
+  b -= a;  b ^= rot(a, 6);  a += c; \
+  c -= b;  c ^= rot(b, 8);  b += a; \
+  a -= c;  a ^= rot(c,16);  c += b; \
+  b -= a;  b ^= rot(a,19);  a += c; \
+  c -= b;  c ^= rot(b, 4);  b += a; \
+}
+
+/*----------
+ * final -- final mixing of 3 32-bit values (a,b,c) into c
+ *
+ * Pairs of (a,b,c) values differing in only a few bits will usually
+ * produce values of c that look totally different.  This was tested for
+ * * pairs that differed by one bit, by two bits, in any combination
+ *   of top bits of (a,b,c), or in any combination of bottom bits of
+ *   (a,b,c).
+ * * "differ" is defined as +, -, ^, or ~^.  For + and -, I transformed
+ *   the output delta to a Gray code (a^(a>>1)) so a string of 1's (as
+ *   is commonly produced by subtraction) look like a single 1-bit
+ *   difference.
+ * * the base values were pseudorandom, all zero but one bit set, or
+ *   all zero plus a counter that starts at zero.
+ *     
+ * The use of separate functions for mix() and final() allow for a
+ * substantial performance increase since final() does not need to
+ * do well in reverse, but is does need to affect all output bits.
+ * mix(), on the other hand, does not need to affect all output
+ * bits (affecting 32 bits is enough).The original hash function had
+ * a single mixing operation that had to satisfy both sets of requirements
+ * and was slower as a result.
+ *----------
+ */
+#define final(a,b,c) \
+{ \
+  c ^= b; c -= rot(b,14); \
+  a ^= c; a -= rot(c,11); \
+  b ^= a; b -= rot(a,25); \
+  c ^= b; c -= rot(b,16); \
+  a ^= c; a -= rot(c,4);  \
+  b ^= a; b -= rot(a,14); \
+  c ^= b; c -= rot(b,24); \
 }
 
 /*
@@ -260,8 +315,7 @@
 
        /* Set up the internal state */
        len = keylen;
-       a = b = 0x9e3779b9;                     /* the golden ratio; an 
arbitrary value */
-       c = 3923095;                            /* initialize with an arbitrary 
value */
+       a = b = c = 0x9e3779b9 + len + 3923095;
 
        /* If the source pointer is word-aligned, we use word-wide fetches */
        if (((long) k & UINT32_ALIGN_MASK) == 0)
@@ -445,7 +499,7 @@
 #endif /* WORDS_BIGENDIAN */
        }
 
-       mix(a, b, c);
+       final(a, b, c);
 
        /* report the result */
        return UInt32GetDatum(c);
@@ -465,11 +519,10 @@
                                b,
                                c;
 
-       a = 0x9e3779b9 + k;
-       b = 0x9e3779b9;
-       c = 3923095 + (uint32) sizeof(uint32);
+       a = b = c = 0x9e3779b9 + (uint32) sizeof(uint32) + 3923095;
+       a += k;
 
-       mix(a, b, c);
+       final(a, b, c);
 
        /* report the result */
        return UInt32GetDatum(c);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to