Re: Faster HashMap implementation

Doug Lea Wed, 22 Jul 2009 09:21:48 -0700

Sorry for the delay getting back to this.

I put an update of HashMapV2 at
  http://gee.cs.oswego.edu/dl/wwwtmp/HashMapV2.java
It includes the better support for tiny maps and probe sequences that
were missing.


On further evaluating it (among other possible HashMap replacements),
I'm not convinced that they are enough better to commit.

To explain why, I'll first back up to the main observations that lead
to exploring open-address (unchained) tables.  Here are runs of JDK
HashMap (using chained) vs IdentityHashMap (IHM) (using open
addressing) on (further updated) MapMicroBenchmark on an x86 (this is
typical of other x86 and sparc runs).  Also shown is current
HashMapV2.  Times in nanosecs per element-operation, averaged across 8
key types.

     Size:   9     36    144    576   2304   9216  36864 147456 589824
HashMap     49     33     30     33     37     45     97    208    273
IHM         41     28     26     28     31     35     58    151    188
V2          47     31     29     37     36     40     78    188    257

On these (and other) tests, IdentityHashMap is from 15% to 60% faster
than HashMap.  Plus it typically has a smaller footprint, by around
40% in steady state (although for tiny tables it has a larger
footprint.)  So you'd think that by compromising a bit on the
properties that make IHM faster, it would be possible to arrive at
something close in revised HashMap.  HashMapV2 usually is in-between
on both time and space, but unfortunately much closer to HashMap,
close enough that it is probably no better than making a couple of
simple tweaks to the current HashMap (see below).

The main issues are hash quality and caching:

1. IdentityHashCodes have good randomness properties. (This is by
design: In hotspot, they are produced by a simple random number
generator.)  This means that basic linear-probe open-address table in
IHM usually runs close to its theoretical expected performance level.
To effectively deal with non-randomness of hashCode (vs
identityHashCodes) for common key types, you need to both precondition
(bit-spread) them, and give up on linear-probing and instead use
psuedo-randomized probing techniques.  (If you do no preconditioning
and use linear probing, you get factor of 20X slowdowns for some known
key type/value usages.)  The first costs computation overhead; the
second costs memory locality.  While you can trade these off a bit
(heavier preconditioning vs more locality of subsequent probes), I
don't know of pair of such techniques better than those in current V2.
And even these measures cannot achieve as much randomness as you get
in IdentityHashMap. In particular, neither help much when dealing with
the common problem of encountering much greater than (theorecticallY
expected numbers of duplicate hashCodes. In a chained table, these
poison only one bin, but in open-adresss they tend to reduce overall
performance. So all in all, the hit rates are noticeably worse.

2. Because Objects themselves are required to cache their own
identityHashCodes, there is no reason for IdentityHashMap to do so. So
there is no need for a side-array to hold them.  The overall space
reduction also makes it an easy call to use a 2/3 vs 3/4 load factor
threshold, which is a good/common value for open-address tables.  The
need for side-array of hashes in HashMapV2 not only takes up some
space, but also leads to cache misses, mainly on collisions. For large
tables, these cases noticeably hurt overall throughput. Note: One
reason for caching hashes is compatibility. For many key types (see
below) it is a better tradeoff to recompute them when needed for
rehashing and not to bother using them to pre-screen equals. But there
are exceptions, and people out there surely have code the relies on
lack of recomputation.

(In case anyone is curious, the best version I tried with heavy
randomizer, linear probe, and no side-array has benchmark stats:
     Size:   9     36    144    576   2304   9216  36864 147456 589824
average     46     36     38     44     43     49     88    235    294
which is OK, but without any room to improve to the point of being
good enough to invite a slew of performance bug reports about hash
recomputations.)

Further, the minor-looking compatibility issues with current HashMap
(like needing an entirely different map to handle overflow after
500million entries) all add some constant overhead.  Plus the
intrinsically worse performance on some entrySet operations because of
lack of Entry objects. Simple escape analysis will often but not
always come to the rescue here, so it is a relatively minor concern.
But all together these contribute to the conclusion that open-address
approach doesn't have enough advantages to replace HashMap. Similarly
for the mixed-side-array approach of Alex's Compact/Fast HashMap.

As I mentioned in previous posts, it is possible to use open
addressing to get speedups and footprint reductions comparable to IHM
for the set of key types that seem to account for up to 80% of all
usages: Strings, Integer/Long, and identity.  All three share lack of
need to cache hash codes plus either good randomization properties or
known cheap ways to attain them.  Identity is already supported, but
people don't seem to choose IdentityHashMaps all that often even when
they apply.  The Integer case is supported in a not-very-nice way in
Sun JDK releases under -XX:+AggressiveOpts.  Strings are by far the
most common case but no one has tried anything comparable.  Internal
specialization might be worth a try but I can't get too excited about
prospects for the only available means of doing this -- optimistic
type tests followed by costly de-optimization when you are wrong. This
extends the lack of performance predicatability you already get in
Java to living with higher-order algorithmic luckiness. Explicit
specialization (as is upcoming in Scala) seems like a much better
line of attack.

There is though one issue surrounding current HashMaps that does seem
to deserve a short-term minor improvement. People (including the IBM
"bloat" paper folks) have repeatedly shown that a great many HashMaps
are either permanently empty or hold at most a few elements. It would
be pretty straightforward to accommodate this by (1) lazily
initializing arrays (2) initially allocating only capacity 4, then
jumping to current default of 16 on first resize. This seems like a
simple win. Testing this out, it seems to have almost no performance
impact on non-tiny map usages. (In part because adding an initial
4->16 resize warms up the resize code.)

-Doug

Re: Faster HashMap implementation

Reply via email to