On Fri, Aug 7, 2009 at 8:14 PM, John Harrop <jharrop...@gmail.com> wrote:
> Your core loop seems to be: > (loop [zr (double 0.0) > zi (double 0.0) > zr2 (double 0.0) > zi2 (double 0.0) > iterations-remaining iterations-remaining] > (if (and (not (neg? iterations-remaining)) > (< (+ zr2 zi2) limit-square)) > (let [new-zi (double (+ (* (* f2 zr) zi) pi)) > new-zr (double (+ (- zr2 zi2) pr)) > new-zr2 (double (* new-zr new-zr)) > new-zi2 (double (* new-zi new-zi))] > (recur new-zr new-zi new-zr2 new-zi2 (dec iterations-remaining))) > > What I suggest is > > (loop [zr (double 0.0) > zi (double 0.0) > i (int (inc iterations-remaining))] > (let [zr2 (* zr zr) > zi2 (* zi zi)] > (if (and (not (= 0 i)) (< (+ zr2 zi2 limit-square))) > (recur (+ (- zr2 zi2) pr) (+ (* (* f2 zr) zi) pi) (unchecked-inc i)) > (whatever...)))) > > * Same calculations > * Less items in recur rebind > * Counter is also primitive > > (untested, may need slight tweakage) > I've got some interesting new benchmark results here. user=> (time (let [m (int 100000000) b (double 4.0) x0 (double -0.2) y0 (double 0.1) t (double 2.0)] (loop [x (double 0.0) y (double 0.0) i (int 0)] (if (< i m) (let [x2 (* x x) y2 (* y y)] (if (< (+ x2 y2) b) (recur (+ (- x2 y2) x0) (+ (* t (* x y)) y0) (unchecked-inc i)) i)) i)))) "Elapsed time: 875.36796 msecs" 100000000 A very straightforward version, and 875.36796ms/100000000 = 8.7536796ns. This is on a 2.5GHz machine, so that's only about 22 native instructions per iteration. The theoretical minimum is over 1/2 of that: Four fmuls Three fadds and an fsub A floating point comparison An integer add and an integer comparison A branch back to the start of the loop So we're within a factor of 2 of *native* code speeds (nevermind Java) here. user=> (time (let [m (int 100000000) b (double 4.0) x0 (double -0.2) y0 (double 0.1) t (double 2.0)] (loop [x (double 0.0) y (double 0.0) i m] (if (> i 0) (let [x2 (* x x) y2 (* y y)] (if (< (+ x2 y2) b) (recur (+ (- x2 y2) x0) (+ (* t (* x y)) y0) (unchecked-dec i)) i)) i)))) "Elapsed time: 3418.8542 msecs" 0 Shockingly, reversing the count-up to a count-down and not changing anything else at all makes things much, MUCH worse, about four times slower. user=> (time (let [m (int 100000000) b (double 4.0) x0 (double -0.2) y0 (double 0.1) t (double 2.0)] (loop [x (double 0.0) y (double 0.0) i m] (if (zero? i) i (let [x2 (* x x) y2 (* y y)] (if (< (+ x2 y2) b) (recur (+ (- x2 y2) x0) (+ (* t (* x y)) y0) (unchecked-dec i)) i)))))) "Elapsed time: 874.9564 msecs" 0 The original, at-least-half-of-C speed again. It seems like < and zero? are as fast as each other and > much slower on primitive ints. user=> ^#'< {:ns #<Namespace clojure.core>, :name <, :file "clojure/core.clj", :line 589, :arglists ([x] [x y] [x y & more]), :inline-arities #{2}, :inline #<core$fn__3363 clojure.core$fn__3...@b655a>, :doc "Returns non-nil if nums are in monotonically increasing order,\n otherwise false."} user=> ^#'zero? {:ns #<Namespace clojure.core>, :name zero?, :file "clojure/core.clj", :line 742, :arglists ([x]), :inline #<core$fn__3485 clojure.core$fn__3...@7787a5>, :tag java.lang.Boolean, :doc "Returns true if num is zero, else false"} user=> ^#'> {:ns #<Namespace clojure.core>, :name >, :file "clojure/core.clj", :line 617, :arglists ([x] [x y] [x y & more]), :inline-arities #{2}, :inline #<core$fn__3379 clojure.core$fn__3...@51e67c>, :doc "Returns non-nil if nums are in monotonically decreasing order,\n otherwise false."} It isn't that > isn't inlined, either. I think the inline implemenation #<core$fn__3379 clojure.core$fn__3...@51e67c> of > needs to be investigated for why it is apparently vastly slower than that of <. In the meantime, I'm not sure why Andy's code is still 3x slower than Java, while mine runs at more than half the speed of native C. Perhaps it's a difference in our JVMs or hardware. I'm using 1.6.0u13 -server on a 2.5GHz Intel CPU. If Andy's using a different JVM, not using -server, or using a CPU with fewer registers (to run as fast as it does my loop must be running entirely in CPU registers -- there's not enough "slack time" between the 5ns minimum to execute the calculations at 2.5GHz and the 9ns actual time to make a round trip to my 7ns RAM -- but on another CPU arch with fewer registers the loop might not fit in the registers. The bigger difference between C and Clojure on Andy's system could then result if the JIT is making it less register-efficient (using say one or two more) than hand-tuned assembly or a good C optimizer, and this is making the difference on hardware without as many registers as my Intel chip. Even a single 7ns memory fetch delay, if not pipelined with computations not dependent on it, would significantly degrade the speed of this tight loop on fast hardware. Another possibility is pipelining differences. On my CPU it might be memory-fetching something every loop but running some of the calculations during this fetch, so the fetch seems to only take 4ns instead of 7. Maybe Andy's CPU doesn't do as good a job. Yet another is that the fetch is from the L1 cache and takes only 4ns on mine, and longer on Andy's (L1 cache speed difference). (Again, Andy's C vs. Java speed difference would need explaining as the C being more efficient in some way, perhaps in registers used.) Yet another thought that occurs to me is a difference in Clojure versions. I'm using 1.0; if he's using bleeding-edge and there was a performance regression in one of the operations used in the loop (or in loop/recur itself) that would do it. Yet another is that Andy's C code is actually faster than the "theoretical maximum" by using SIMD instructions that the JIT fails to use. (The (* x x) and (* y y) can probably be parallelized with 3DNOW/MMX, and the (+ (...) x0) and (+ (...) y0) likewise. So, differences in hardware (registers, cache speed, pipeline stalls, or more than one of those), Clojure version, or JVM version could account for the difference, or the C code using SIMD and the JIT-generated code failing to do so. Another way SIMD or other such instructions could matter is if my Java version's JIT exploits them and Andy's does not (this would be another JVM version associated case). Or, Sun's JIT exploits only one of 3DNOW and MMX, the C code uses the other, and Andy's CPU only supports the other; since my CPU supports both, that could explain the differences. It would help if Andy posted the results of evaluating the above three expressions at his REPL, and also posted his CPU speed in GHz, or just the results of using those to calculate the number of instruction cycles needed for each of the three expressions. If the estimated number of instruction cycles is close to 22 for the < and zero? loops, then it's the C code that's astonishingly fast, probably due to SIMD, or else Andy's version of the loop differs from mine in some subtle but performance-impacting way. To distinguish the latter two cases, Andy's loop can be timed by wrapping in "time" and its number of cycles per iteration estimated; if that estimate is also close to 22, the C blazes, else Andy's loop is somehow slower than mine. If the estimated number of instruction cycles is much more than 22 (say, more than 25) then Andy's system (CPU, JVM, Clojure version) is the source of the performance difference. Note: the expressions should be run three or four times. The first two or three timings will be longer than the later ones. (JIT?) Run until the times are consistent and at least three repetitions have been run in rapid succession. Note 2: someone with access to disassembly/memory debugger/perf tools might be able to learn more by running these loops, Andy's Clojure, and Andy's C on their system. Disassemblies of the JIT-generated code for the Clojure code and of the compiled C code would be interesting to compare, in particular. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---