Poor hashtable collision performance

Eric Blake Sat, 17 Apr 2021 06:31:44 -0700

Consider the following program I wrote to divide a 50,000 byte string
into a stack of one character per definition (this is supposed to be an
O(n log n) task, because it repeatedly cuts the input in half, and
should be more efficient than iterating on
substr(`$1',0,1)$0(substr(`$1',1) with its O(n^2) complexity):


$ cat tmp
define(`chew', `ifelse($1, 1, `pushdef(`stack', substr(`$2',0, 1))',
  `$0(eval($1/2), substr(`$2', 0, eval($1/2))_)$0(eval($1 - $1/2),
  substr(`$2', eval($1/2)))')')dnl
chew(50000, format(`%050000d', 0))dnl
$ time m4 tmp

real    0m22.264s
user    0m22.180s
sys     0m0.010s
$ time m4 -H517 tmp

real    0m0.321s
user    0m0.316s
sys     0m0.004s

What gives?  It turns out that with the default hash size of 509,
'stack' and 'substr' hash to the same bucket, and as the pushdef depth
of stack increases, the lookup time for 'substr' gets progressively
worse.  With -H517, the two names hash to different buckets and no
longer interfere.  I'm working on a patch to separate the pushdef chain
from the hash bucket chain, so that hash collisions only have to visit
one instance of a colliding name, not the entire stack.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Poor hashtable collision performance

Reply via email to