[ 
https://issues.apache.org/jira/browse/LUCENE-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-2967.
---------------------------------

       Resolution: Won't Fix
    Lucene Fields:   (was: [New])

I spent some time on this. It's quite fascinating: the number of collisions for 
the default probing is smaller than:

a) linear probing with murmurhash mix of the original hash
b) linear probing without murmurhash mix (start from raw hash only).

Curiously, the number of collisions for (b) is smaller than for (a) -- this 
could be explained if we assume bits are spread evently throughout the entire 
32-bit range after murmurhash, so after masking to table size there should be 
more collisions on lower bits compared to a raw hash (this would have more 
collisions on upper bits and fewer on lower bits because it is 
multiplicative... or at least I think so).

Anyway, I tried many different versions and I don't see any significant 
difference in favor of linear probing here. Measured the GC overhead during my 
tests too, but it is not the primary factor contributing to the total cost of 
constructing the FST (about 3-5% of the total time, running in parallel, 
typically).

> Use linear probing with an additional good bit avalanching function in FST's 
> NodeHash.
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2967
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2967
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Trivial
>             Fix For: 4.0
>
>         Attachments: LUCENE-2967.patch
>
>
> I recently had an interesting discussion with Sebastiano Vigna (fastutil), 
> who suggested that linear probing, given a hash mixing function with good 
> avalanche properties, is a way better method of constructing lookups in 
> associative arrays compared to quadratic probing. Indeed, with linear probing 
> you can implement removals from a hash map without removed slot markers and 
> linear probing has nice properties with respect to modern CPUs (caches). I've 
> reimplemented HPPC's hash maps to use linear probing and we observed a nice 
> speedup (the same applies for fastutils of course).
> This patch changes NodeHash's implementation to use linear probing. The code 
> is a bit simpler (I think :). I also moved the load factor to a constant -- 
> 0.5 seems like a generous load factor, especially if we allow large FSTs to 
> be built. I don't see any significant speedup in constructing large automata, 
> but there is no slowdown either (I checked on one machine only for now, but 
> will verify on other machines too).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to