Something got me thinking about the hashes of sequences, and I realized there would be an efficiency tradeoff there. The hash ought to depend on at least some of the head elements of the seq, or hash lookup efficiency goes to hell when seqs are used as map keys or in sets. The more elements of the seq are used, the better the hash gets, but with diminishing returns. On the other hand, the more elements of the seq are used, the more elements of a *lazy* seq must be *realized* if it is used as a key or placed in a set.
This made me wonder where the tradeoff was, so I wrote this pair of functions to play warmer-colder with: (defn q1 [n x] (concat (range n) [x])) (defn q2 [n] (= (.hashCode (q1 n "foo")) (.hashCode (q1 n "bar")))) I figured the minimum n needed to get a true return from q2 would be a small integer, but the result was rather surprising: user=> (q2 50000) false It seems to be using potentially all of the elements, which seems rather inefficient when a few will be enough to get rid of most hash collisions for commonplace data and in the absence of a collision an equality test (which needs all the elements up to the first difference) can be avoided until lookup time, deferring the realization of potentially very many lazy sequence elements. What's the rationale here? Are hash collisions caused by differing seqs with long identical prefixes expected to be commonplace? With strings long identical prefixes are frequent (e.g. file paths all in some deeply nested directory, or URLs with similar structure) but this tends to occur only in situations where "long" means "several tens of characters and occasionally even more than 100", not fifty thousand or more... I also noticed that though using all of the elements (if that is indeed what it does) permits making the hash consistent with String's (i.e. (seq "foo") and "foo" would have the same hash) and with the hashes of vectors, we actually have: user=> (map #(.hashCode %) ["foo" (seq "foo") [\f \o \o]]) (101574 131365 131365) Vectors and their seqs do indeed appear to have the same hash (right off this one data point we already have 10^5-to-one odds against their not doing so, since if they don't then the two identical numbers in the output above, given their order of magnitude, are a 10^5-to-one-against coincidence *or worse* [more below]). However, strings seem not to equal their own seqs as to hash code. Vectors and seqs with the same content in the same order also compare equal, but a string and its seq don't. (Fortunately, given that they have different hashes.) [The exact argument for the probabilities uses Bayes's Theorem: P(A|B) = P(B|A)P(A)/P(B). If A is "the algorithms are different" and B is "the hashes are identical", P(B|A) is no more than about 0.00001 if the hashes are fairly uniform on a range spanning at least from 1 to the observed numbers. A good hash function *should* be fairly uniform and the actual range of Java hashes spans from -2,147,483,658 to 2,147,483,647 so 0.00001 is quite conservative. Suppose we accept a prior probability of 0.5 for A. Then P(B) would have to be 0.5 + 0.5P(B|A) ~ 0.5, since P(B|A) is very small, and P(A|B) ~ P(B|A) is also very small. So if we started out with a coin-toss guess as to whether the null hypothesis A was correct, we'd now be rejecting it with a pretty strong confidence! Even starting with a "likely" probability of 0.9 for A gives P(B) = 0.1 + 0.9P(B|A) ~ 0.1 and P(A)/P(B) ~ 9, and we still end up rejecting A with a lot of confidence, as 0.00009 is still a very small number. Now, strictly speaking, we'd have to have chosen the string randomly and uniformly on some large space of strings to call this number anything close to scientific, but I'd say it's still a pretty good bet that ... oh why not just check it: <digs up source code> Seems that APersistentVector and ASeq implement hashCode separately, but do indeed use the same algorithm. It also looks like ASeq's indeed doesn't stop until (and unless!) it runs out of sequence to consume. The duplication of the algorithm is presumably for efficiency versus APersistentVector's using _hash = seq().hashCode(); instead when it computes its hash, which would wrap the traversal in an extra layer of method calls and create temporary objects for every element traversed. Still, *why* not just externalize the algorithm to a static method of clojure.lang.Util that takes an Iterator and call that from both places? The vector version already uses the vector's iterator and meanwhile the seq iterator exists and isn't going to be much less efficient than the direct traversal in ASeq.hashCode.] -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en