Been following closely all the threads here (and learning a lot).  Is there a 
point where we just start using HPPC or do we do something it doesn't and so 
the performance tests here are simply a pedagogical exercise for those 
interested in hashing strategies b/c there is no way we can switch or there is 
no practical reason to switch?

(ducks)

Has anyone hooked up a profiler to Mahout and run the tests below to see where 
our bottlenecks are?  Robin seemed to indicate to me that we should be as fast 
the other day.

-Grant

On Jun 6, 2013, at 3:22 PM, Robin Anil <[email protected]> wrote:

> I also did a small experiment myself, not conclusive but putting here for
> the sake of record.
> 
> adjustOrPutValue() was calling put if it didn't find the entry in the
> table, which re-did the hash lookup, it turns out that reducing that extra
> lookup didn't help the benchmarks significantly. (1.4 drop is not
> significant)
> 
> 
> Before
>        library    ms linear runtime
>           HPPC  85.5 =================
>          TROVE 112.9 =======================
>  FASTUTIL_OPEN  65.3 =============
> FASTUTIL_LINKED  61.5 ============
>         MAHOUT 143.4 ==============================
> 
> 
> After
> 
>        library    ms linear runtime
>           HPPC  85.3 ==================
>          TROVE 111.1 =======================
>  FASTUTIL_OPEN  65.0 =============
> FASTUTIL_LINKED  61.2 ============
>         MAHOUT 142.0 ==============================
> 
> 
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> 
> 
> On Thu, Jun 6, 2013 at 9:05 AM, Dawid Weiss 
> <[email protected]>wrote:
> 
>>> Yes I thought this was why you could largely get away with it. If the
>>> table is big enough, collisions and chains should not be a big deal to
>>> begin with.
>> 
>> I admit I didn't think it could be double hashing behind so I assumed 2^n-1
>> -- sorry for not checking before. Good to be able to compare alternate
>> implementations (the upside of prime-based sizes is that they grow somewhat
>> less aggressively).
>> 
>>> Since you have this benchmark readily available -- what happens if you
>>> un-comment one or both of the lines of code that mixes the hash? That
>>> would really answer it.
>> 
>> I've ran the same set of benchmarks substituting MurmurHash3's last
>> avalanche step in HashOperations. The results are, in short, that it does
>> get a tad bit slower on the average, but is (predictably) more stable --
>> 
>> BenchmarkContainsWithRemoved (2000000 calls):
>> Before:
>> removedKeys    ms linear runtime
>>          0  95.6 =====================
>>       0.25 116.2 ==========================
>>        0.5 131.8 ==============================
>>       0.75  78.3 =================
>>          1  12.9 ==
>> After
>> removedKeys    ms linear runtime
>>          0 101.4 =====================
>>       0.25 123.6 ==========================
>>        0.5 138.6 ==============================
>>       0.75  82.0 =================
>>          1  14.7 ===
>> 
>> BenchmarkPut (1000000 calls):
>> Before:
>> distribution   ms linear runtime
>>      RANDOM 61.8 ==============================
>>      LINEAR 20.9 ==========
>>    HIGHBITS 54.8 ==========================
>> After:
>> distribution   ms linear runtime
>>      RANDOM 64.9 =============================
>>      LINEAR 65.2 =============================
>>    HIGHBITS 65.7 ==============================
>> 
>> So you're right -- I don't think it makes sense to add that hashing unless
>> you can devise a worst-case sequence that will break double hashing (I
>> can't). Note, however, that these times are generally slower than with
>> 2^n-1 implementations with hashing and linear probing; this seems clear
>> taking into account CPU caches and the fact that less (and simpler) math
>> needs to be done for every key. The advantage I see in double hashing is
>> being able to devise a prime sequence that doesn't grow as fast as 2^n.
>> 
>> So the conclusion: no need for hashing. Which makes me thing why bother to
>> do this then:
>> 
>>  public static int hash(float value) {
>>    return Float.floatToIntBits(value * 663608941.737f);
>> 
>> or this, for that matter:
>> 
>>  public static int hash(long value) {
>>    return (int) (value ^ (value >> 32));
>> 
>> Note all bits will cancel out to zero if your input keys are (value << 32 |
>> value). If you remember I mentioned guarding against the worst case -- a
>> random hash seems to be less probable then this (run it):
>> 
>>        OpenLongIntHashMap map = new OpenLongIntHashMap();
>>        for (int i = 0; i < 1000000; i++) {
>>            long k = (long) i << 32 | i;
>>            map.put(k, i);
>>        }
>> 
>> Dawid
>> 

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com





Reply via email to