Two partial answers in-line below, but as it says there, Mark Engelberg has
more experimental data than I do right now on proposed changes.


On Fri, Oct 25, 2013 at 9:11 AM, Stuart Halloway
<stuart.hallo...@gmail.com>wrote:

> To save others from having to read this whole thread, is the following
> summary correct?
>
> 1. We believe that the performance issue Paul first encountered is
> substantially / entirely caused by hash collisions.
>

There are definitely many hash collisions with Paul's program.  I believe
Mark Engelberg has already run experiments to show that most/all of these
collisions go away with one proposed change to the hash functions.


> 2. The hash collisions could be improved by spreading the hashes of
> numbers (and not needing to mess with hashes of collections).
>

If the only change made were to spread out the hashes of numbers using a
linear function, e.g. something of the form const1 * number + const2, then
the same hash collisions would remain, due to the current hash functions
for vectors/sequences and sets and the nature of Paul's data (which were
definitely not created with an eye towards breaking the hash functions --
they are quite reasonable things a programmer would hope to work
efficiently without needing to know any special tricks).

If the only change made were to spread out the hashes of numbers using a
non-linear function, e.g. something like the number*number suggested by
Mark Engelberg, then it needs more verification than I personally have done
so far, but I believe that would help avoid many of the collisions.  I
think Mark Engelberg could make comments based on more collected data than
I have.

I do not know the answers to questions a or b below.

Andy


>
> If the above summary is correct, then I have the following questions:
>
> a. Is the approach Scala took substantially #2 above, or something
> different / more?
>
> b. Is changing the hash codes of numbers going to break expectations on
> the Java side of things?
>
>
> On Thu, Oct 24, 2013 at 8:37 PM, Evan Gamble <solar.f...@gmail.com> wrote:
>
>> As Andy Fingerhut pointed out, since (hash [a b]) is 31*(a + 30) + (b +
>> 31) when a and b are ints, summing the hashes of 2-tuples when the values
>> are small ints, as happens when hashing sets of such 2-tuples, is quite
>> likely to lead to collisions. This particular problem could be avoided by
>> spreading the hashes of ints out to large numbers with a non-linear
>> function, not just multiplying them all by some N.
>>
>>
>> On Wednesday, October 23, 2013 5:41:15 PM UTC-7, puzzler wrote:
>>
>>> Perhaps I don't understand your point, but vector hashing is not
>>> currently the sum of hashes.  It's a more complex computation.
>>>
>>> => (hash [0 2])
>>> 963
>>> => (hash [2 0])
>>> 1023
>>> => (hash [2])
>>> 33
>>>
>>> The fact that numbers hash to themselves means that the resulting hashes
>>> are not hugely spread apart, but collisions don't seem to be as common as
>>> with sets and maps.  I'm sure it could be improved, but vector/sequence
>>> hashing doesn't strike me as a huge issue.
>>>
>>>
>>> On Wed, Oct 23, 2013 at 5:31 PM, Cedric Greevey <cgre...@gmail.com>wrote:
>>>
>>>> There's still a problem with collisions among *vectors* though.
>>>>
>>>> I'd say the real underlying problem is that small integers hash to
>>>> themselves. If you let N be a fairly large number satisfying GCD(N, 2^32 -
>>>> 1) = 1, then multiplying 32-bit integers by N would send nearby integers to
>>>> widely separated ones, while inducing a permutation on nonzero integers --
>>>> the hash space wouldn't be any smaller. That alone won't help much with the
>>>> vector problem, since 3N + 2N = 5N just as 3 + 2 = 5, though it may help
>>>> avoid collisions with small integers in small hash maps when the hash is
>>>> being used modulo a small hash-table length. What might help with vectors,
>>>> sets, and other small data structures containing integers (but would shrink
>>>> the hash space by about half) is to additionally square the result (all
>>>> operations being 2s-complement 32-bit integer operations with wrapping --
>>>> Java "int" arithmetic). Now 1 and 4 won't give the same summed hashes as 3
>>>> and 2, and 0 and 5 will be different again. You might also want to add some
>>>> constant to avoid 0 hashing to 0.
>>>>
>>>>  --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to clojure+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to