Lazy sequence efficiency/hashing efficiency tradeoff

Cedric Greevey Sun, 15 Jan 2012 17:27:58 -0800

Something got me thinking about the hashes of sequences, and I
realized there would be an efficiency tradeoff there. The hash ought
to depend on at least some of the head elements of the seq, or hash
lookup efficiency goes to hell when seqs are used as map keys or in
sets. The more elements of the seq are used, the better the hash gets,
but with diminishing returns. On the other hand, the more elements of
the seq are used, the more elements of a *lazy* seq must be *realized*
if it is used as a key or placed in a set.


This made me wonder where the tradeoff was, so I wrote this pair of
functions to play warmer-colder with:

(defn q1 [n x] (concat (range n) [x]))
(defn q2 [n] (= (.hashCode (q1 n "foo")) (.hashCode (q1 n "bar"))))

I figured the minimum n needed to get a true return from q2 would be a
small integer, but the result was rather surprising:

user=> (q2 50000)
false

It seems to be using potentially all of the elements, which seems
rather inefficient when a few will be enough to get rid of most hash
collisions for commonplace data and in the absence of a collision an
equality test (which needs all the elements up to the first
difference) can be avoided until lookup time, deferring the
realization of potentially very many lazy sequence elements. What's
the rationale here? Are hash collisions caused by differing seqs with
long identical prefixes expected to be commonplace? With strings long
identical prefixes are frequent (e.g. file paths all in some deeply
nested directory, or URLs with similar structure) but this tends to
occur only in situations where "long" means "several tens of
characters and occasionally even more than 100", not fifty thousand or
more...

I also noticed that though using all of the elements (if that is
indeed what it does) permits making the hash consistent with String's
(i.e. (seq "foo") and "foo" would have the same hash) and with the
hashes of vectors, we actually have:

user=> (map #(.hashCode %) ["foo" (seq "foo") [\f \o \o]])
(101574 131365 131365)

Vectors and their seqs do indeed appear to have the same hash (right
off this one data point we already have 10^5-to-one odds against their
not doing so, since if they don't then the two identical numbers in
the output above, given their order of magnitude, are a
10^5-to-one-against coincidence *or worse* [more below]). However,
strings seem not to equal their own seqs as to hash code.

Vectors and seqs with the same content in the same order also compare
equal, but a string and its seq don't. (Fortunately, given that they
have different hashes.)

[The exact argument for the probabilities uses Bayes's Theorem: P(A|B)
= P(B|A)P(A)/P(B). If A is "the algorithms are different" and B is
"the hashes are identical", P(B|A) is no more than about 0.00001 if
the hashes are fairly uniform on a range spanning at least from 1 to
the observed numbers. A good hash function *should* be fairly uniform
and the actual range of Java hashes spans from -2,147,483,658 to
2,147,483,647 so 0.00001 is quite conservative. Suppose we accept a
prior probability of 0.5 for A. Then P(B) would have to be 0.5 +
0.5P(B|A) ~ 0.5, since P(B|A) is very small, and P(A|B) ~ P(B|A) is
also very small. So if we started out with a coin-toss guess as to
whether the null hypothesis A was correct, we'd now be rejecting it
with a pretty strong confidence! Even starting with a "likely"
probability of 0.9 for A gives P(B) = 0.1 + 0.9P(B|A) ~ 0.1 and
P(A)/P(B) ~ 9, and we still end up rejecting A with a lot of
confidence, as 0.00009 is still a very small number. Now, strictly
speaking, we'd have to have chosen the string randomly and uniformly
on some large space of strings to call this number anything close to
scientific, but I'd say it's still a pretty good bet that ... oh why
not just check it: <digs up source code> Seems that APersistentVector
and ASeq implement hashCode separately, but do indeed use the same
algorithm. It also looks like ASeq's indeed doesn't stop until (and
unless!) it runs out of sequence to consume. The duplication of the
algorithm is presumably for efficiency versus APersistentVector's
using _hash = seq().hashCode(); instead when it computes its hash,
which would wrap the traversal in an extra layer of method calls and
create temporary objects for every element traversed. Still, *why* not
just externalize the algorithm to a static method of clojure.lang.Util
that takes an Iterator and call that from both places? The vector
version already uses the vector's iterator and meanwhile the seq
iterator exists and isn't going to be much less efficient than the
direct traversal in ASeq.hashCode.]

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Lazy sequence efficiency/hashing efficiency tradeoff

Reply via email to