A while ago we discussed if it'd make sense to integrate HPPC with Mahout, I remember Ted mentioned it'd be good to have some kind of comparative benchmark that would simply compare Mahout Math collections against HPPC. I finally found the time to do it, driven by some strong updates Sebastiano Vinga introduced in fastutil. But first, a disclaimer :)
I am not trying to impose HPPC on anybody and I'm not trying to show which library is "better" or "worse" in a general sense. I believe the benchmarks presented below and any conclusions that can be drawn from them apply only in a marginal sense to any algorithm complex enough to be doing something ELSE other than stressing a given hash map implementation. I also believe that diversity drives progress, so it's good to share ;) So. First, I've been thinking about how to measure hash map performance in any sensible (read: reasonable) application. Sebastiano suggested some clues and I came up with some other ideas, but any realistic benchmarks (in the sense of data or algorithm) would be a welcome extension. I've implemented the following benchmarks (all on int-int maps): - BenchmarkPut: injecting a lot of integers in a map, basically what one would do to get a unique set of keys, for example. The input ints have different distributions: linear, random and high-bits linear (increments on higher bits, then on lower bits). - BenchmarkGetWithRemoved: injecting a lot of integers, then removing a fraction of them (not benchmarked), and (benchmarked) querying this map for a partially matching set of keys. A realistic scenario if you have removals from the map. - BenchmarkBigramCounting: calculating bigram frequencies from a real corpus of texts in Polish. The input is 10MB long, so it's not a big data set, but the texts are realistic and hence so are bigram distributions. I tested the following hash map implementations (int-int primitives): - fastutil (6.1.0), - HPPC (trunk, but without improvements resulting from these benchmarks yet, so I'm not biased ;) - Java HashMap (on boxed Integer keys) - GNU Trove (3.0.0rc1), - Mahout Math/Collections (1.0) The bigram counting also includes PCJ and fastutil's linked hash map implementation for historic reasons (but these are less relevant). The benchmarks were done on a Windows7 machine with I7 CPU, detailed log attached, using Google Caliper. Times below are in ms. Links should lead you to on-line charts. Some conclusions: 1) BenchmarkPut, http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkPut distribution implementation ms linear runtime RANDOM HPPC 77.9 ========== LINEAR HPPC 79.1 ========== HIGHBITS HPPC 78.4 ========== RANDOM FASTUTIL 97.4 ============ LINEAR FASTUTIL 101.2 ============= HIGHBITS FASTUTIL 101.3 ============= RANDOM JAVA 232.6 ============================== LINEAR JAVA 72.7 ========= HIGHBITS JAVA 184.2 ======================= RANDOM MAHOUT 125.9 ================ LINEAR MAHOUT 60.3 ======= HIGHBITS MAHOUT 98.4 ============ RANDOM TROVE 113.8 ============== LINEAR TROVE 43.8 ===== HIGHBITS TROVE 83.7 ========== I believe additional good hashing of integer keys does help to even-out any weird input key distributions (and these would occur a lot in practical situations, I bet). fastutil uses the finalization step from MurmurHash3, HPPC uses a simplified full-pass over int4 bytes from MurmurHash2, but these yield an equal effect (the speed difference is a result of another thing). Mahout Math currently doesn't hash ints, so it'd be probably a good addition (at very little overhead). 2) BenchmarkGetWithRemoved: http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved %rem.keys implementation ms linear runtime 0 HPPC 65.73 ============ 0 FASTUTIL 52.63 ========== 0 JAVA 156.37 ============================== 0 TROVE 75.31 ============== 0 MAHOUT 85.59 ================ 0.25 HPPC 87.19 ================ 0.25 FASTUTIL 58.24 =========== 0.25 JAVA 149.26 ============================ 0.25 TROVE 123.32 ======================= 0.25 MAHOUT 113.77 ===================== 0.5 HPPC 101.70 =================== 0.5 FASTUTIL 55.74 ========== 0.5 JAVA 135.01 ========================= 0.5 TROVE 93.19 ================= 0.5 MAHOUT 125.16 ======================== 0.75 HPPC 93.98 ================== 0.75 FASTUTIL 42.20 ======== 0.75 JAVA 101.29 =================== 0.75 TROVE 76.07 ============== 0.75 MAHOUT 94.67 ================== 1 HPPC 60.52 =========== 1 FASTUTIL 10.78 == 1 JAVA 44.30 ======== 1 TROVE 9.54 = 1 MAHOUT 30.98 ===== This shows two things. The first thing is that open addressing based on double hashing (with REMOVED state markers) requires either automatic or manual compaction (rehashing) to avoid performance degradation. HPPC doesn't do this because we rely on the user and so the time grows with the number of removed keys: 0 HPPC 65.73 ============ 0.25 HPPC 87.19 ================ 0.5 HPPC 101.70 =================== 0.75 HPPC 93.98 ================== 1 HPPC 60.52 =========== (the drop at 100% of removed keys is probably HotSpot overlearning and optimizing differently). It is curious as to why Mahout's implementation was so much slower than anything else, but I didn't have the time to investigate yet. The only thing that comes to my mind is using modulus % for sizing instead of bitmasking (restricted 2^n map sizes). Also note that even regular Java does a fairly good job here (the number of keys was 2,000,000, but even for larger inputs the results were quite similar). Also, note fastutil here: 0 FASTUTIL 52.63 ========== 0.25 FASTUTIL 58.24 =========== 0.5 FASTUTIL 55.74 ========== 0.75 FASTUTIL 42.20 ======== 1 FASTUTIL 10.78 == Nice. Sebastiano reimplemented hash maps to use linear addressing instead of double hashing and with linear addressing you can avoid REMOVED state markers entirely (so no need to rehash and you'll get empty spots for reuse if adding new elements). The key here is the combination of linear addressing AND a good key hash mangling function (theoretically, with a perfect hashing function, the input key distribution wouldn't affect linear addressing performance at all -- no clusters or chains effect. 3) Bigrams, http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting This is a toy example, I know, but it's the only realistic one in the set, so it's probably worth mentioning. It is a little bit puzzling to me why HPPC is (by a small margin) better than fastutil because it really shouldn't be. This isn't exactly noise (variance is given in plain text attached), but I still consider this a glitch within the regular variance. You can try these on your own machine -- I would actually appreciate if you sent your results and feedback back to me or this list so that we can see how the above holds consistent betwen architectures, processors and VMs. git clone [email protected]:carrotsearch/hppc.git cd hppc-others (cd lib/; bash ./install-deps ) mvn clean package java -server -jar hppc-others-0.4.0-SNAPSHOT/hppc-others*.jar | tee benchmark.log I didn't check the performance of simple lists or other structures with minimal overhead because I don't imagine they being significantly faster or slower in any library. For these other facts play a role (I personally like the fact I can access the internal buffers in HPPC, but it may against be a point against using it ;). Dawid
