Two partial answers in-line below, but as it says there, Mark Engelberg has more experimental data than I do right now on proposed changes.
On Fri, Oct 25, 2013 at 9:11 AM, Stuart Halloway <stuart.hallo...@gmail.com>wrote: > To save others from having to read this whole thread, is the following > summary correct? > > 1. We believe that the performance issue Paul first encountered is > substantially / entirely caused by hash collisions. > There are definitely many hash collisions with Paul's program. I believe Mark Engelberg has already run experiments to show that most/all of these collisions go away with one proposed change to the hash functions. > 2. The hash collisions could be improved by spreading the hashes of > numbers (and not needing to mess with hashes of collections). > If the only change made were to spread out the hashes of numbers using a linear function, e.g. something of the form const1 * number + const2, then the same hash collisions would remain, due to the current hash functions for vectors/sequences and sets and the nature of Paul's data (which were definitely not created with an eye towards breaking the hash functions -- they are quite reasonable things a programmer would hope to work efficiently without needing to know any special tricks). If the only change made were to spread out the hashes of numbers using a non-linear function, e.g. something like the number*number suggested by Mark Engelberg, then it needs more verification than I personally have done so far, but I believe that would help avoid many of the collisions. I think Mark Engelberg could make comments based on more collected data than I have. I do not know the answers to questions a or b below. Andy > > If the above summary is correct, then I have the following questions: > > a. Is the approach Scala took substantially #2 above, or something > different / more? > > b. Is changing the hash codes of numbers going to break expectations on > the Java side of things? > > > On Thu, Oct 24, 2013 at 8:37 PM, Evan Gamble <solar.f...@gmail.com> wrote: > >> As Andy Fingerhut pointed out, since (hash [a b]) is 31*(a + 30) + (b + >> 31) when a and b are ints, summing the hashes of 2-tuples when the values >> are small ints, as happens when hashing sets of such 2-tuples, is quite >> likely to lead to collisions. This particular problem could be avoided by >> spreading the hashes of ints out to large numbers with a non-linear >> function, not just multiplying them all by some N. >> >> >> On Wednesday, October 23, 2013 5:41:15 PM UTC-7, puzzler wrote: >> >>> Perhaps I don't understand your point, but vector hashing is not >>> currently the sum of hashes. It's a more complex computation. >>> >>> => (hash [0 2]) >>> 963 >>> => (hash [2 0]) >>> 1023 >>> => (hash [2]) >>> 33 >>> >>> The fact that numbers hash to themselves means that the resulting hashes >>> are not hugely spread apart, but collisions don't seem to be as common as >>> with sets and maps. I'm sure it could be improved, but vector/sequence >>> hashing doesn't strike me as a huge issue. >>> >>> >>> On Wed, Oct 23, 2013 at 5:31 PM, Cedric Greevey <cgre...@gmail.com>wrote: >>> >>>> There's still a problem with collisions among *vectors* though. >>>> >>>> I'd say the real underlying problem is that small integers hash to >>>> themselves. If you let N be a fairly large number satisfying GCD(N, 2^32 - >>>> 1) = 1, then multiplying 32-bit integers by N would send nearby integers to >>>> widely separated ones, while inducing a permutation on nonzero integers -- >>>> the hash space wouldn't be any smaller. That alone won't help much with the >>>> vector problem, since 3N + 2N = 5N just as 3 + 2 = 5, though it may help >>>> avoid collisions with small integers in small hash maps when the hash is >>>> being used modulo a small hash-table length. What might help with vectors, >>>> sets, and other small data structures containing integers (but would shrink >>>> the hash space by about half) is to additionally square the result (all >>>> operations being 2s-complement 32-bit integer operations with wrapping -- >>>> Java "int" arithmetic). Now 1 and 4 won't give the same summed hashes as 3 >>>> and 2, and 0 and 5 will be different again. You might also want to add some >>>> constant to avoid 0 hashing to 0. >>>> >>>> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clojure@googlegroups.com >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+unsubscr...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.