Performance of primitive collections

Dawid Weiss Fri, 04 Mar 2011 03:22:33 -0800

A while ago we discussed if it'd make sense to integrate HPPC with Mahout, I
remember Ted mentioned it'd be good to have some kind of comparative
benchmark that would simply compare Mahout Math collections against HPPC. I
finally found the time to do it, driven by some strong updates Sebastiano
Vinga introduced in fastutil. But first, a disclaimer :)


I am not trying to impose HPPC on anybody and I'm not trying to show which
library is "better" or "worse" in a general sense. I believe the benchmarks
presented below and any conclusions that can be drawn from them apply only
in a marginal sense to any algorithm complex enough to be doing something
ELSE other than stressing a given hash map implementation. I also believe
that diversity drives progress, so it's good to share ;)

So. First, I've been thinking about how to measure hash map performance in
any sensible (read: reasonable) application. Sebastiano suggested some clues
and I came up with some other ideas, but any realistic benchmarks (in the
sense of data or algorithm) would be a welcome extension. I've implemented
the following benchmarks (all on int-int maps):

- BenchmarkPut: injecting a lot of integers in a map, basically what one
would do to get a unique set of keys, for example. The input ints have
different distributions: linear, random and high-bits linear (increments on
higher bits, then on lower bits).

- BenchmarkGetWithRemoved: injecting a lot of integers, then removing a
fraction of them (not benchmarked), and (benchmarked) querying this map for
a partially matching set of keys. A realistic scenario if you have removals
from the map.

- BenchmarkBigramCounting: calculating bigram frequencies from a real corpus
of texts in Polish. The input is 10MB long, so it's not a big data set, but
the texts are realistic and hence so are bigram distributions.

I tested the following hash map implementations (int-int primitives):

- fastutil (6.1.0),
- HPPC (trunk, but without improvements resulting from these benchmarks yet,
so I'm not biased ;)
- Java HashMap (on boxed Integer keys)
- GNU Trove (3.0.0rc1),
- Mahout Math/Collections (1.0)

The bigram counting also includes PCJ and fastutil's linked hash map
implementation for historic reasons (but these are less relevant).

The benchmarks were done on a Windows7 machine with I7 CPU, detailed log
attached, using Google Caliper. Times below are in ms. Links
should lead you to on-line charts. Some conclusions:

1) BenchmarkPut,
http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkPut

distribution implementation    ms linear runtime
      RANDOM           HPPC  77.9 ==========
      LINEAR           HPPC  79.1 ==========
    HIGHBITS           HPPC  78.4 ==========
      RANDOM       FASTUTIL  97.4 ============
      LINEAR       FASTUTIL 101.2 =============
    HIGHBITS       FASTUTIL 101.3 =============
      RANDOM           JAVA 232.6 ==============================
      LINEAR           JAVA  72.7 =========
    HIGHBITS           JAVA 184.2 =======================
      RANDOM         MAHOUT 125.9 ================
      LINEAR         MAHOUT  60.3 =======
    HIGHBITS         MAHOUT  98.4 ============
      RANDOM          TROVE 113.8 ==============
      LINEAR          TROVE  43.8 =====
    HIGHBITS          TROVE  83.7 ==========

I believe additional good hashing of integer keys does help to even-out any
weird input key distributions (and these would occur a lot in practical
situations, I bet). fastutil uses the finalization step from MurmurHash3,
HPPC uses a simplified full-pass over int4 bytes from MurmurHash2, but these
yield an equal effect (the speed difference is a result of another thing).
Mahout Math currently doesn't hash ints, so it'd be probably a good addition
(at very little overhead).

2) BenchmarkGetWithRemoved:
http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved

%rem.keys   implementation     ms linear runtime
          0           HPPC  65.73 ============
          0       FASTUTIL  52.63 ==========
          0           JAVA 156.37 ==============================
          0          TROVE  75.31 ==============
          0         MAHOUT  85.59 ================
       0.25           HPPC  87.19 ================
       0.25       FASTUTIL  58.24 ===========
       0.25           JAVA 149.26 ============================
       0.25          TROVE 123.32 =======================
       0.25         MAHOUT 113.77 =====================
        0.5           HPPC 101.70 ===================
        0.5       FASTUTIL  55.74 ==========
        0.5           JAVA 135.01 =========================
        0.5          TROVE  93.19 =================
        0.5         MAHOUT 125.16 ========================
       0.75           HPPC  93.98 ==================
       0.75       FASTUTIL  42.20 ========
       0.75           JAVA 101.29 ===================
       0.75          TROVE  76.07 ==============
       0.75         MAHOUT  94.67 ==================
          1           HPPC  60.52 ===========
          1       FASTUTIL  10.78 ==
          1           JAVA  44.30 ========
          1          TROVE   9.54 =
          1         MAHOUT  30.98 =====

This shows two things. The first thing is that open addressing based on
double hashing (with REMOVED state markers) requires either automatic or
manual compaction (rehashing) to avoid performance degradation. HPPC doesn't
do this because we rely on the user and so the time grows with the number of
removed keys:

          0           HPPC  65.73 ============
       0.25           HPPC  87.19 ================
        0.5           HPPC 101.70 ===================
       0.75           HPPC  93.98 ==================
          1           HPPC  60.52 ===========

(the drop at 100% of removed keys is probably HotSpot overlearning and
optimizing differently). It is curious as to why Mahout's implementation was
so much slower than anything else, but I didn't have the time to investigate
yet. The only thing that comes to my mind is using modulus % for sizing
instead of bitmasking (restricted 2^n map sizes). Also note that even
regular Java does a fairly good job here (the number of keys was 2,000,000,
but even for larger inputs the results were quite similar). Also, note
fastutil here:

          0       FASTUTIL  52.63 ==========
       0.25       FASTUTIL  58.24 ===========
        0.5       FASTUTIL  55.74 ==========
       0.75       FASTUTIL  42.20 ========
          1       FASTUTIL  10.78 ==

Nice. Sebastiano reimplemented hash maps to use linear addressing instead of
double hashing and with linear addressing you can avoid REMOVED state
markers entirely (so no need to rehash and you'll get empty spots for reuse
if adding new elements). The key here is the combination of linear
addressing AND a good key hash mangling function (theoretically, with a
perfect hashing function, the input key distribution wouldn't affect linear
addressing performance at all -- no clusters or chains effect.

3) Bigrams,
http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting

This is a toy example, I know, but it's the only realistic one in the set,
so it's probably worth mentioning. It is a little bit puzzling to me why
HPPC is (by a small margin) better than fastutil because it really shouldn't
be. This isn't exactly noise (variance is given in plain text attached), but
I still consider this a glitch within the regular variance.

You can try these on your own machine -- I would actually appreciate if you
sent your results and feedback back to me or this list so that we can see
how the above holds consistent betwen architectures, processors and VMs.

git clone [email protected]:carrotsearch/hppc.git
cd hppc-others
(cd lib/; bash ./install-deps )
mvn clean package
java -server -jar hppc-others-0.4.0-SNAPSHOT/hppc-others*.jar | tee
benchmark.log

I didn't check the performance of simple lists or other structures with
minimal overhead because I don't imagine they being significantly faster or
slower in any library. For these other facts play a role (I personally like
the fact I can access the internal buffers in HPPC, but it may against be a
point against using it ;).

Dawid

Performance of primitive collections

Reply via email to