I am 100% in favor of replacing the colt-based thing in there now with something better if there is something better. I'll commit the patch, or help test, or whatever.
I volunteered to put lipstick on the colt-based pig because I really wanted *some* non-GPL alternative to Trove at Apache, and the colt-based code sitting around Mahout was the shortest path. On Fri, Mar 4, 2011 at 6:11 AM, Dawid Weiss <[email protected]> wrote: > A while ago we discussed if it'd make sense to integrate HPPC with Mahout, I > remember Ted mentioned it'd be good to have some kind of comparative > benchmark that would simply compare Mahout Math collections against HPPC. I > finally found the time to do it, driven by some strong updates Sebastiano > Vinga introduced in fastutil. But first, a disclaimer :) > > I am not trying to impose HPPC on anybody and I'm not trying to show which > library is "better" or "worse" in a general sense. I believe the benchmarks > presented below and any conclusions that can be drawn from them apply only > in a marginal sense to any algorithm complex enough to be doing something > ELSE other than stressing a given hash map implementation. I also believe > that diversity drives progress, so it's good to share ;) > > So. First, I've been thinking about how to measure hash map performance in > any sensible (read: reasonable) application. Sebastiano suggested some clues > and I came up with some other ideas, but any realistic benchmarks (in the > sense of data or algorithm) would be a welcome extension. I've implemented > the following benchmarks (all on int-int maps): > > - BenchmarkPut: injecting a lot of integers in a map, basically what one > would do to get a unique set of keys, for example. The input ints have > different distributions: linear, random and high-bits linear (increments on > higher bits, then on lower bits). > > - BenchmarkGetWithRemoved: injecting a lot of integers, then removing a > fraction of them (not benchmarked), and (benchmarked) querying this map for > a partially matching set of keys. A realistic scenario if you have removals > from the map. > > - BenchmarkBigramCounting: calculating bigram frequencies from a real corpus > of texts in Polish. The input is 10MB long, so it's not a big data set, but > the texts are realistic and hence so are bigram distributions. > > I tested the following hash map implementations (int-int primitives): > > - fastutil (6.1.0), > - HPPC (trunk, but without improvements resulting from these benchmarks yet, > so I'm not biased ;) > - Java HashMap (on boxed Integer keys) > - GNU Trove (3.0.0rc1), > - Mahout Math/Collections (1.0) > > The bigram counting also includes PCJ and fastutil's linked hash map > implementation for historic reasons (but these are less relevant). > > The benchmarks were done on a Windows7 machine with I7 CPU, detailed log > attached, using Google Caliper. Times below are in ms. Links > should lead you to on-line charts. Some conclusions: > > 1) BenchmarkPut, > http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkPut > > distribution implementation ms linear runtime > RANDOM HPPC 77.9 ========== > LINEAR HPPC 79.1 ========== > HIGHBITS HPPC 78.4 ========== > RANDOM FASTUTIL 97.4 ============ > LINEAR FASTUTIL 101.2 ============= > HIGHBITS FASTUTIL 101.3 ============= > RANDOM JAVA 232.6 ============================== > LINEAR JAVA 72.7 ========= > HIGHBITS JAVA 184.2 ======================= > RANDOM MAHOUT 125.9 ================ > LINEAR MAHOUT 60.3 ======= > HIGHBITS MAHOUT 98.4 ============ > RANDOM TROVE 113.8 ============== > LINEAR TROVE 43.8 ===== > HIGHBITS TROVE 83.7 ========== > > I believe additional good hashing of integer keys does help to even-out any > weird input key distributions (and these would occur a lot in practical > situations, I bet). fastutil uses the finalization step from MurmurHash3, > HPPC uses a simplified full-pass over int4 bytes from MurmurHash2, but these > yield an equal effect (the speed difference is a result of another thing). > Mahout Math currently doesn't hash ints, so it'd be probably a good addition > (at very little overhead). > > 2) BenchmarkGetWithRemoved: > http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved > > %rem.keys implementation ms linear runtime > 0 HPPC 65.73 ============ > 0 FASTUTIL 52.63 ========== > 0 JAVA 156.37 ============================== > 0 TROVE 75.31 ============== > 0 MAHOUT 85.59 ================ > 0.25 HPPC 87.19 ================ > 0.25 FASTUTIL 58.24 =========== > 0.25 JAVA 149.26 ============================ > 0.25 TROVE 123.32 ======================= > 0.25 MAHOUT 113.77 ===================== > 0.5 HPPC 101.70 =================== > 0.5 FASTUTIL 55.74 ========== > 0.5 JAVA 135.01 ========================= > 0.5 TROVE 93.19 ================= > 0.5 MAHOUT 125.16 ======================== > 0.75 HPPC 93.98 ================== > 0.75 FASTUTIL 42.20 ======== > 0.75 JAVA 101.29 =================== > 0.75 TROVE 76.07 ============== > 0.75 MAHOUT 94.67 ================== > 1 HPPC 60.52 =========== > 1 FASTUTIL 10.78 == > 1 JAVA 44.30 ======== > 1 TROVE 9.54 = > 1 MAHOUT 30.98 ===== > > This shows two things. The first thing is that open addressing based on > double hashing (with REMOVED state markers) requires either automatic or > manual compaction (rehashing) to avoid performance degradation. HPPC doesn't > do this because we rely on the user and so the time grows with the number of > removed keys: > > 0 HPPC 65.73 ============ > 0.25 HPPC 87.19 ================ > 0.5 HPPC 101.70 =================== > 0.75 HPPC 93.98 ================== > 1 HPPC 60.52 =========== > > (the drop at 100% of removed keys is probably HotSpot overlearning and > optimizing differently). It is curious as to why Mahout's implementation was > so much slower than anything else, but I didn't have the time to investigate > yet. The only thing that comes to my mind is using modulus % for sizing > instead of bitmasking (restricted 2^n map sizes). Also note that even > regular Java does a fairly good job here (the number of keys was 2,000,000, > but even for larger inputs the results were quite similar). Also, note > fastutil here: > > 0 FASTUTIL 52.63 ========== > 0.25 FASTUTIL 58.24 =========== > 0.5 FASTUTIL 55.74 ========== > 0.75 FASTUTIL 42.20 ======== > 1 FASTUTIL 10.78 == > > Nice. Sebastiano reimplemented hash maps to use linear addressing instead of > double hashing and with linear addressing you can avoid REMOVED state > markers entirely (so no need to rehash and you'll get empty spots for reuse > if adding new elements). The key here is the combination of linear > addressing AND a good key hash mangling function (theoretically, with a > perfect hashing function, the input key distribution wouldn't affect linear > addressing performance at all -- no clusters or chains effect. > > 3) Bigrams, > http://dawidweiss123.appspot.com/run/[email protected]/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting > > This is a toy example, I know, but it's the only realistic one in the set, > so it's probably worth mentioning. It is a little bit puzzling to me why > HPPC is (by a small margin) better than fastutil because it really shouldn't > be. This isn't exactly noise (variance is given in plain text attached), but > I still consider this a glitch within the regular variance. > > You can try these on your own machine -- I would actually appreciate if you > sent your results and feedback back to me or this list so that we can see > how the above holds consistent betwen architectures, processors and VMs. > > git clone [email protected]:carrotsearch/hppc.git > cd hppc-others > (cd lib/; bash ./install-deps ) > mvn clean package > java -server -jar hppc-others-0.4.0-SNAPSHOT/hppc-others*.jar | tee > benchmark.log > > I didn't check the performance of simple lists or other structures with > minimal overhead because I don't imagine they being significantly faster or > slower in any library. For these other facts play a role (I personally like > the fact I can access the internal buffers in HPPC, but it may against be a > point against using it ;). > > Dawid >
