There is a few other things to notice here. This only affects our RASV and not the others. Higher order vector ops have tricks to speed it up, we are talking about < 10% speedup margin overall for algorithms. This doesn't say anything about iteration speed, which is also what we care a lot about.
Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Thu, Jun 6, 2013 at 10:20 AM, Grant Ingersoll <[email protected]>wrote: > Been following closely all the threads here (and learning a lot). Is > there a point where we just start using HPPC or do we do something it > doesn't and so the performance tests here are simply a pedagogical exercise > for those interested in hashing strategies b/c there is no way we can > switch or there is no practical reason to switch? > > (ducks) > > Has anyone hooked up a profiler to Mahout and run the tests below to see > where our bottlenecks are? Robin seemed to indicate to me that we should > be as fast the other day. > > -Grant > > On Jun 6, 2013, at 3:22 PM, Robin Anil <[email protected]> wrote: > > > I also did a small experiment myself, not conclusive but putting here for > > the sake of record. > > > > adjustOrPutValue() was calling put if it didn't find the entry in the > > table, which re-did the hash lookup, it turns out that reducing that > extra > > lookup didn't help the benchmarks significantly. (1.4 drop is not > > significant) > > > > > > Before > > library ms linear runtime > > HPPC 85.5 ================= > > TROVE 112.9 ======================= > > FASTUTIL_OPEN 65.3 ============= > > FASTUTIL_LINKED 61.5 ============ > > MAHOUT 143.4 ============================== > > > > > > After > > > > library ms linear runtime > > HPPC 85.3 ================== > > TROVE 111.1 ======================= > > FASTUTIL_OPEN 65.0 ============= > > FASTUTIL_LINKED 61.2 ============ > > MAHOUT 142.0 ============================== > > > > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > > > > > On Thu, Jun 6, 2013 at 9:05 AM, Dawid Weiss < > [email protected]>wrote: > > > >>> Yes I thought this was why you could largely get away with it. If the > >>> table is big enough, collisions and chains should not be a big deal to > >>> begin with. > >> > >> I admit I didn't think it could be double hashing behind so I assumed > 2^n-1 > >> -- sorry for not checking before. Good to be able to compare alternate > >> implementations (the upside of prime-based sizes is that they grow > somewhat > >> less aggressively). > >> > >>> Since you have this benchmark readily available -- what happens if you > >>> un-comment one or both of the lines of code that mixes the hash? That > >>> would really answer it. > >> > >> I've ran the same set of benchmarks substituting MurmurHash3's last > >> avalanche step in HashOperations. The results are, in short, that it > does > >> get a tad bit slower on the average, but is (predictably) more stable -- > >> > >> BenchmarkContainsWithRemoved (2000000 calls): > >> Before: > >> removedKeys ms linear runtime > >> 0 95.6 ===================== > >> 0.25 116.2 ========================== > >> 0.5 131.8 ============================== > >> 0.75 78.3 ================= > >> 1 12.9 == > >> After > >> removedKeys ms linear runtime > >> 0 101.4 ===================== > >> 0.25 123.6 ========================== > >> 0.5 138.6 ============================== > >> 0.75 82.0 ================= > >> 1 14.7 === > >> > >> BenchmarkPut (1000000 calls): > >> Before: > >> distribution ms linear runtime > >> RANDOM 61.8 ============================== > >> LINEAR 20.9 ========== > >> HIGHBITS 54.8 ========================== > >> After: > >> distribution ms linear runtime > >> RANDOM 64.9 ============================= > >> LINEAR 65.2 ============================= > >> HIGHBITS 65.7 ============================== > >> > >> So you're right -- I don't think it makes sense to add that hashing > unless > >> you can devise a worst-case sequence that will break double hashing (I > >> can't). Note, however, that these times are generally slower than with > >> 2^n-1 implementations with hashing and linear probing; this seems clear > >> taking into account CPU caches and the fact that less (and simpler) math > >> needs to be done for every key. The advantage I see in double hashing is > >> being able to devise a prime sequence that doesn't grow as fast as 2^n. > >> > >> So the conclusion: no need for hashing. Which makes me thing why bother > to > >> do this then: > >> > >> public static int hash(float value) { > >> return Float.floatToIntBits(value * 663608941.737f); > >> > >> or this, for that matter: > >> > >> public static int hash(long value) { > >> return (int) (value ^ (value >> 32)); > >> > >> Note all bits will cancel out to zero if your input keys are (value << > 32 | > >> value). If you remember I mentioned guarding against the worst case -- a > >> random hash seems to be less probable then this (run it): > >> > >> OpenLongIntHashMap map = new OpenLongIntHashMap(); > >> for (int i = 0; i < 1000000; i++) { > >> long k = (long) i << 32 | i; > >> map.put(k, i); > >> } > >> > >> Dawid > >> > > -------------------------------------------- > Grant Ingersoll | @gsingers > http://www.lucidworks.com > > > > > >
