On Fri, Aug 7, 2009 at 8:14 PM, John Harrop <jharrop...@gmail.com> wrote:

> Your core loop seems to be:
> (loop [zr (double 0.0)
>          zi (double 0.0)
>          zr2 (double 0.0)
>          zi2 (double 0.0)
>          iterations-remaining iterations-remaining]
>     (if (and (not (neg? iterations-remaining))
>              (< (+ zr2 zi2) limit-square))
>       (let [new-zi (double (+ (* (* f2 zr) zi) pi))
>             new-zr (double (+ (- zr2 zi2) pr))
>             new-zr2 (double (* new-zr new-zr))
>             new-zi2 (double (* new-zi new-zi))]
>         (recur new-zr new-zi new-zr2 new-zi2 (dec iterations-remaining)))
>
> What I suggest is
>
> (loop [zr (double 0.0)
>        zi (double 0.0)
>        i (int (inc iterations-remaining))]
>   (let [zr2 (* zr zr)
>         zi2 (* zi zi)]
>     (if (and (not (= 0 i)) (< (+ zr2 zi2 limit-square)))
>        (recur (+ (- zr2 zi2) pr) (+ (* (* f2 zr) zi) pi) (unchecked-inc i))
>        (whatever...))))
>
> * Same calculations
> * Less items in recur rebind
> * Counter is also primitive
>
> (untested, may need slight tweakage)
>

I've got some interesting new benchmark results here.

user=> (time
         (let [m (int 100000000) b (double 4.0) x0 (double -0.2) y0 (double
0.1) t (double 2.0)]
           (loop [x (double 0.0) y (double 0.0) i (int 0)]
             (if (< i m)
               (let [x2 (* x x) y2 (* y y)]
                 (if (< (+ x2 y2) b)
                   (recur (+ (- x2 y2) x0) (+ (* t (* x y)) y0)
(unchecked-inc i))
                 i))
               i))))
"Elapsed time: 875.36796 msecs"
100000000

A very straightforward version, and 875.36796ms/100000000 = 8.7536796ns.
This is on a 2.5GHz machine, so that's only about 22 native instructions per
iteration. The theoretical minimum is over 1/2 of that:

Four fmuls
Three fadds and an fsub
A floating point comparison
An integer add and an integer comparison
A branch back to the start of the loop

So we're within a factor of 2 of *native* code speeds (nevermind Java) here.

user=> (time
         (let [m (int 100000000) b (double 4.0) x0 (double -0.2) y0 (double
0.1) t (double 2.0)]
           (loop [x (double 0.0) y (double 0.0) i m]
             (if (> i 0)
               (let [x2 (* x x) y2 (* y y)]
                 (if (< (+ x2 y2) b)
                   (recur (+ (- x2 y2) x0) (+ (* t (* x y)) y0)
(unchecked-dec i))
                 i))
               i))))
"Elapsed time: 3418.8542 msecs"
0

Shockingly, reversing the count-up to a count-down and not changing anything
else at all makes things much, MUCH worse, about four times slower.

user=> (time
         (let [m (int 100000000) b (double 4.0) x0 (double -0.2) y0 (double
0.1) t (double 2.0)]
           (loop [x (double 0.0) y (double 0.0) i m]
             (if (zero? i)
               i
               (let [x2 (* x x) y2 (* y y)]
                 (if (< (+ x2 y2) b)
                   (recur (+ (- x2 y2) x0) (+ (* t (* x y)) y0)
(unchecked-dec i))
                 i))))))
"Elapsed time: 874.9564 msecs"
0

The original, at-least-half-of-C speed again.

It seems like < and zero? are as fast as each other and > much slower on
primitive ints.

user=> ^#'<
{:ns #<Namespace clojure.core>, :name <, :file "clojure/core.clj", :line
589, :arglists ([x] [x y] [x y & more]), :inline-arities #{2}, :inline
#<core$fn__3363 clojure.core$fn__3...@b655a>, :doc "Returns non-nil if nums
are in monotonically increasing order,\n  otherwise false."}
user=> ^#'zero?
{:ns #<Namespace clojure.core>, :name zero?, :file "clojure/core.clj", :line
742, :arglists ([x]), :inline #<core$fn__3485 clojure.core$fn__3...@7787a5>,
:tag java.lang.Boolean, :doc "Returns true if num is zero, else false"}
user=> ^#'>
{:ns #<Namespace clojure.core>, :name >, :file "clojure/core.clj", :line
617, :arglists ([x] [x y] [x y & more]), :inline-arities #{2}, :inline
#<core$fn__3379 clojure.core$fn__3...@51e67c>, :doc "Returns non-nil if nums
are in monotonically decreasing order,\n  otherwise false."}

It isn't that > isn't inlined, either.

I think the inline implemenation #<core$fn__3379
clojure.core$fn__3...@51e67c> of > needs to be investigated for why it is
apparently vastly slower than that of <.

In the meantime, I'm not sure why Andy's code is still 3x slower than Java,
while mine runs at more than half the speed of native C. Perhaps it's a
difference in our JVMs or hardware. I'm using 1.6.0u13 -server on a 2.5GHz
Intel CPU. If Andy's using a different JVM, not using -server, or using a
CPU with fewer registers (to run as fast as it does my loop must be running
entirely in CPU registers -- there's not enough "slack time" between the 5ns
minimum to execute the calculations at 2.5GHz and the 9ns actual time to
make a round trip to my 7ns RAM -- but on another CPU arch with fewer
registers the loop might not fit in the registers. The bigger difference
between C and Clojure on Andy's system could then result if the JIT is
making it less register-efficient (using say one or two more) than
hand-tuned assembly or a good C optimizer, and this is making the difference
on hardware without as many registers as my Intel chip. Even a single 7ns
memory fetch delay, if not pipelined with computations not dependent on it,
would significantly degrade the speed of this tight loop on fast hardware.

Another possibility is pipelining differences. On my CPU it might be
memory-fetching something every loop but running some of the calculations
during this fetch, so the fetch seems to only take 4ns instead of 7. Maybe
Andy's CPU doesn't do as good a job. Yet another is that the fetch is from
the L1 cache and takes only 4ns on mine, and longer on Andy's (L1 cache
speed difference). (Again, Andy's C vs. Java speed difference would need
explaining as the C being more efficient in some way, perhaps in registers
used.) Yet another thought that occurs to me is a difference in Clojure
versions. I'm using 1.0; if he's using bleeding-edge and there was a
performance regression in one of the operations used in the loop (or in
loop/recur itself) that would do it. Yet another is that Andy's C code is
actually faster than the "theoretical maximum" by using SIMD instructions
that the JIT fails to use. (The (* x x) and (* y y) can probably be
parallelized with 3DNOW/MMX, and the (+ (...) x0) and (+ (...) y0) likewise.

So, differences in hardware (registers, cache speed, pipeline stalls, or
more than one of those), Clojure version, or JVM version could account for
the difference, or the C code using SIMD and the JIT-generated code failing
to do so.

Another way SIMD or other such instructions could matter is if my Java
version's JIT exploits them and Andy's does not (this would be another JVM
version associated case). Or, Sun's JIT exploits only one of 3DNOW and MMX,
the C code uses the other, and Andy's CPU only supports the other; since my
CPU supports both, that could explain the differences.

It would help if Andy posted the results of evaluating the above three
expressions at his REPL, and also posted his CPU speed in GHz, or just the
results of using those to calculate the number of instruction cycles needed
for each of the three expressions.

If the estimated number of instruction cycles is close to 22 for the < and
zero? loops, then it's the C code that's astonishingly fast, probably due to
SIMD, or else Andy's version of the loop differs from mine in some subtle
but performance-impacting way. To distinguish the latter two cases, Andy's
loop can be timed by wrapping in "time" and its number of cycles per
iteration estimated; if that estimate is also close to 22, the C blazes,
else Andy's loop is somehow slower than mine.

If the estimated number of instruction cycles is much more than 22 (say,
more than 25) then Andy's system (CPU, JVM, Clojure version) is the source
of the performance difference.

Note: the expressions should be run three or four times. The first two or
three timings will be longer than the later ones. (JIT?) Run until the times
are consistent and at least three repetitions have been run in rapid
succession.

Note 2: someone with access to disassembly/memory debugger/perf tools might
be able to learn more by running these loops, Andy's Clojure, and Andy's C
on their system. Disassemblies of the JIT-generated code for the Clojure
code and of the compiled C code would be interesting to compare, in
particular.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to