I'm working on a program that involves deeply nested data structures. I'm
running into numerous problems working with such structures in Clojure. It
appears that many of Clojure's core functions involving hashing and
equality produce stack overflows with deeply nested data structures, making
it nearly impossible to do anything useful with the large data structures.
I'd like to have a larger discussion about that, but in this post, I want
to start out by focusing on a very specific example that seems like an
easily solvable performance bug.
My understanding is that in the latest version of Clojure, hashes, once
computed, are cached. It is also clear that if two objects have different
hashes, they can't be equal. So I consider the following exchange to be a
bug, because it is clear the differing hashes are being ignored:
; Function to produce deeply nested data structure
(defn make-list
([n] (make-list n (list "a" "a")))
([n a]
(if (zero? n) a
(recur (dec n) (list "a" a)))))
(def a (doall (make-list 7000)))
(def b (doall (make-list 8000)))
=> (hash a)
27780065
=> (hash b)
31748065
=> (= a b)
StackOverflowError clojure.lang.ASeq.equiv (ASeq.java:39)
=> (pst)
StackOverflowError
clojure.lang.ASeq.equiv (ASeq.java:39)
clojure.lang.Util.pcequiv (Util.java:79)
clojure.lang.Util.equiv (Util.java:31)
clojure.lang.ASeq.equiv (ASeq.java:42)
clojure.lang.Util.pcequiv (Util.java:79)
clojure.lang.Util.equiv (Util.java:31)
clojure.lang.ASeq.equiv (ASeq.java:42)
clojure.lang.Util.pcequiv (Util.java:79)
clojure.lang.Util.equiv (Util.java:31)
clojure.lang.ASeq.equiv (ASeq.java:42)
clojure.lang.Util.pcequiv (Util.java:79)
clojure.lang.Util.equiv (Util.java:31)
nil
Some thoughts about this:
My guess is that the current behavior is a design decision based on the
fact that sequences might be infinite, and therefore it isn't necessarily
safe to call hash on them before comparing the sequences for equality.
However, if the hash has already been cached, why not use it?
Of course, solving this particular issue won't really solve the overall
problem:
(= (make-list 7000) (make-list 7000)) will still cause a stack overflow.
(hash (make-list 10000)) will also cause a stack overflow.
I think to solve the overall problem, both hashing and equality testing
will need to be rewritten in a way that doesn't grow the stack.
Even though it doesn't solve the overall problem, I still think that
looking at the cached hash for collection equality would yield meaningful
performance improvements.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en