starting and getting the most out of concurrent processes

Lee Spector Wed, 04 Aug 2010 07:41:14 -0700

Apologies for the length of this message -- I'm hoping to be complete, but that 
made the message pretty long.


Also BTW most of the tests below were run using Clojure 1.1. If part of the 
answer to my questions is "use 1.2" then I'll upgrade ASAP (but I haven't done 
so yet because I'd prefer to be confused by one thing at a time :-). I don't 
think that can be the full answer, though, since the last batch of runs below 
WERE run under 1.2 and they're also problematic...

Also, for most of the runs described here (with the one exception noted below) 
I am running under Linux:

[lspec...@fly ~]$ cat /proc/version
Linux version 2.6.18-164.6.1.el5 (mockbu...@builder10.centos.org) (gcc version 
4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Nov 3 16:12:36 EST 2009

with this Java version:

[lspec...@fly ~]$ java -version
java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)

SO: Most of the documentation and discussion about clojure concurrency is about 
managing state that may be shared between concurrent processes, but I have what 
I guess are more basic questions about how concurrent processes can/should be 
started even in the absence of shared state (or when all that's shared is 
immutable) and about how to get the most out of concurrency on multiple cores. 

I often have large numbers of relatively long, independent processes and I want 
to farm them out to multiple cores. (For those who care this is often in the 
context of evolutionary computation systems, with each of the processes being a 
fitness test.) I had thought that I was farming these out in the right way to 
multiple cores, using agents or sometimes just pmap, but then I noticed that my 
runtimes weren't scaling in the way that I expected across machines with 
different numbers of cores (even though I usually saw near total utilization of 
all cores in "top").

This led me to do some more systematic testing and I'm confused/concerned about 
what I'm seeing, so I'm going to present my tests and results here in the hope 
that someone can clear things up for me. I know that timing things in clojure 
can be complicated both on account of laziness and on account of optimizations 
that happen on the Java side, but I think I've done the right things to avoid 
getting tripped up too much by these issues. Still, it's quite possible that 
I've coded some things incorrectly and/or that I'm misunderstanding some basic 
concepts, and I'd appreciate any help that anyone can provide.

First I defined a function that would take a non-trivial amount of time to 
execute, as follows:

(defn burn 
  ([] (count 
        (take 1E6 
          (repeatedly 
            #(* 9999999999 9999999999)))))
  ([_] (burn)))

The implementation with an ignored argument just serves to make some of my 
later calls neater -- I suppose I might incur a tiny additional cost when 
calling it that way but this will be swamped by the things I'm timing.

Then I defined functions for calling this multiple times either sequentially or 
concurrently, using three different techniques for starting the concurrent 
processes:

(defn burn-sequentially [n]
  (print n " sequential burns: ")
  (time (dotimes [i n] (burn))))

(defn burn-via-pmap [n]
  (print n " burns via pmap: ")
  (time (doall (pmap burn (range n)))))

(defn burn-via-futures [n]
  (print n " burns via futures: ")
  (time (doall (pmap deref (map (fn [_] (future (burn))) 
                                                  (range n))))))

(defn burn-via-agents [n]
  (print n " burns via agents: ")
  (time (let [agents (map #(agent %) (range n))]
          (dorun (map #(send % burn) agents))
          (apply await agents))))

Finally, since there's often quite a bit of variability in the run time of 
these things (maybe because of garbage collection? Optimization? I'm not sure), 
I define a simple macro to execute a call three times:

(defmacro thrice [expression] 
  `(do ~expression ~expression ~expression))

Now I can do some timings, and I'll first show you what happens in one of the 
cases where everything performs as expected.

On a 16-core machine (details at 
http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-1.local), 
running four burns thrice, with the code:

(thrice (burn-sequentially 4))
(thrice (burn-via-pmap 4))
(thrice (burn-via-futures 4))
(thrice (burn-via-agents 4))

I get:

4  sequential burns: "Elapsed time: 2308.616 msecs"
4  sequential burns: "Elapsed time: 1510.207 msecs"
4  sequential burns: "Elapsed time: 1182.743 msecs"
4  burns via pmap: "Elapsed time: 470.988 msecs"
4  burns via pmap: "Elapsed time: 457.015 msecs"
4  burns via pmap: "Elapsed time: 446.84 msecs"
4  burns via futures: "Elapsed time: 417.368 msecs"
4  burns via futures: "Elapsed time: 401.444 msecs"
4  burns via futures: "Elapsed time: 398.786 msecs"
4  burns via agents: "Elapsed time: 421.103 msecs"
4  burns via agents: "Elapsed time: 426.775 msecs"
4  burns via agents: "Elapsed time: 408.416 msecs"

The improvement from the first line to the second is something I always see 
(along with frequent improvements across the three calls in a "thrice"), and I 
assume this is due to optimizations talking place in the JVM. Then we see that 
all of the ways of starting concurrent burns perform about the same, and all 
produce a speedup over the sequential burns of somewhere in the neighborhood of 
3x-4x. Pretty much exactly what I would expect and want. So far so good.

However, in the same JVM launch I then went on to do the same thing but with 16 
and then 48 burns in each call:

(thrice (burn-sequentially 16))
(thrice (burn-via-pmap 16))
(thrice (burn-via-futures 16))
(thrice (burn-via-agents 16))

(thrice (burn-sequentially 48))
(thrice (burn-via-pmap 48))
(thrice (burn-via-futures 48))
(thrice (burn-via-agents 48))

This produced:

16  sequential burns: "Elapsed time: 5821.574 msecs"
16  sequential burns: "Elapsed time: 6580.684 msecs"
16  sequential burns: "Elapsed time: 6648.013 msecs"
16  burns via pmap: "Elapsed time: 5953.194 msecs"
16  burns via pmap: "Elapsed time: 7517.196 msecs"
16  burns via pmap: "Elapsed time: 7380.047 msecs"
16  burns via futures: "Elapsed time: 1168.827 msecs"
16  burns via futures: "Elapsed time: 1068.98 msecs"
16  burns via futures: "Elapsed time: 1048.745 msecs"
16  burns via agents: "Elapsed time: 1041.05 msecs"
16  burns via agents: "Elapsed time: 1030.712 msecs"
16  burns via agents: "Elapsed time: 1041.139 msecs"
48  sequential burns: "Elapsed time: 15909.333 msecs"
48  sequential burns: "Elapsed time: 14825.631 msecs"
48  sequential burns: "Elapsed time: 15232.646 msecs"
48  burns via pmap: "Elapsed time: 13586.897 msecs"
48  burns via pmap: "Elapsed time: 3106.56 msecs"
48  burns via pmap: "Elapsed time: 3041.272 msecs"
48  burns via futures: "Elapsed time: 2968.991 msecs"
48  burns via futures: "Elapsed time: 2895.506 msecs"
48  burns via futures: "Elapsed time: 2818.724 msecs"
48  burns via agents: "Elapsed time: 2802.906 msecs"
48  burns via agents: "Elapsed time: 2754.364 msecs"
48  burns via agents: "Elapsed time: 2743.038 msecs"

Looking first at the 16-burn runs, we see that concurrency via pmap is actually 
generally WORSE than sequential. I cannot understand why this should be the 
case. I guess if I were running on a single core I would expect to see a slight 
loss when going to pmap because there would be some cost for managing the 16 
threads that wouldn't be compensated for by actual concurrency. But I'm running 
on 16 cores and I should be getting a major speedup, not a slowdown. There are 
only 16 threads, so there shouldn't be a lot of time lost to overhead.

Also interesting, in this case when I start the processes using futures or 
agents I DO see a speedup. It's on the order of 6x-7x, not close to the 16x 
that I would hope for, but at least it's a speedup. Why is this so different 
from the case with pmap? (Recall that my pmap-based method DID produce about 
the same speedup as my other methods when doing only 4 burns.)

For the calls with 48 burns we again see nearly the expected, reasonably good 
pattern with all concurrent calls performing nearly equivalently (I suppose 
that the steady improvement over all of the calls is again some kind of JVM 
optimization), with a speedup in the concurrent calls over the sequential calls 
in the neighborhood of 5x-6x. Again, not the ~16x that I might hope for, but at 
least it's in the right direction. The very first of the pmap calls with 48 
burns is an anomaly, with only a slight improvement over the sequential calls, 
so I suppose that's another small mystery.

The big mystery so far, however, is in the case of the 16 burns via pmap, which 
is bizarrely slow on this 16-core machine.

Next I tried the same thing on a 48 core machine 
(http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-2.local). 
Here I got:

4  sequential burns: "Elapsed time: 3062.871 msecs"
4  sequential burns: "Elapsed time: 2249.048 msecs"
4  sequential burns: "Elapsed time: 2417.677 msecs"
4  burns via pmap: "Elapsed time: 705.968 msecs"
4  burns via pmap: "Elapsed time: 679.865 msecs"
4  burns via pmap: "Elapsed time: 685.017 msecs"
4  burns via futures: "Elapsed time: 687.097 msecs"
4  burns via futures: "Elapsed time: 636.543 msecs"
4  burns via futures: "Elapsed time: 660.116 msecs"
4  burns via agents: "Elapsed time: 708.163 msecs"
4  burns via agents: "Elapsed time: 709.433 msecs"
4  burns via agents: "Elapsed time: 713.536 msecs"
16  sequential burns: "Elapsed time: 8065.446 msecs"
16  sequential burns: "Elapsed time: 8069.239 msecs"
16  sequential burns: "Elapsed time: 8102.791 msecs"
16  burns via pmap: "Elapsed time: 11288.757 msecs"
16  burns via pmap: "Elapsed time: 12182.506 msecs"
16  burns via pmap: "Elapsed time: 14609.397 msecs"
16  burns via futures: "Elapsed time: 2519.603 msecs"
16  burns via futures: "Elapsed time: 2436.699 msecs"
16  burns via futures: "Elapsed time: 2776.869 msecs"
16  burns via agents: "Elapsed time: 2178.028 msecs"
16  burns via agents: "Elapsed time: 2871.38 msecs"
16  burns via agents: "Elapsed time: 2244.687 msecs"
48  sequential burns: "Elapsed time: 24118.218 msecs"
48  sequential burns: "Elapsed time: 24096.667 msecs"
48  sequential burns: "Elapsed time: 24057.327 msecs"
48  burns via pmap: "Elapsed time: 10369.224 msecs"
48  burns via pmap: "Elapsed time: 6837.071 msecs"
48  burns via pmap: "Elapsed time: 4163.926 msecs"
48  burns via futures: "Elapsed time: 3980.298 msecs"
48  burns via futures: "Elapsed time: 4066.35 msecs"
48  burns via futures: "Elapsed time: 4068.199 msecs"
48  burns via agents: "Elapsed time: 4012.069 msecs"
48  burns via agents: "Elapsed time: 4052.759 msecs"
48  burns via agents: "Elapsed time: 4085.018 msecs"

Essentially this is the same picture that I got on the 16-core machine: decent 
(but less than I would like -- only something like 3x-4x) speedups with most 
concurrent methods in most cases but a bizarre anomaly with 16 burns started 
with pmap, which is again considerably slower than the sequential runs. Why 
should this be? When I run only 4 burns or a full 48 burns the pmap method 
performs decently (that is, at least things get faster than the sequential 
calls), but with 16 burns something very odd happens.

Finally, I ran the same thing on my MacBook Pro 3.06 GHz Intel Core 2 Duo, Mac 
OS X 10.6.4, with Clojure 1.2.0-master-SNAPSHOT under Eclipse/Counterclockwise, 
with a bunch of applications running, so probably this is acting more or less 
like a single core machine, and got:

4  sequential burns: "Elapsed time: 3487.224 msecs"
4  sequential burns: "Elapsed time: 2327.569 msecs"
4  sequential burns: "Elapsed time: 2137.697 msecs"
4  burns via pmap: "Elapsed time: 12478.725 msecs"
4  burns via pmap: "Elapsed time: 12815.75 msecs"
4  burns via pmap: "Elapsed time: 8464.909 msecs"
4  burns via futures: "Elapsed time: 11494.17 msecs"
4  burns via futures: "Elapsed time: 12365.537 msecs"
4  burns via futures: "Elapsed time: 12098.571 msecs"
4  burns via agents: "Elapsed time: 10361.749 msecs"
4  burns via agents: "Elapsed time: 12458.174 msecs"
4  burns via agents: "Elapsed time: 9016.093 msecs"
16  sequential burns: "Elapsed time: 8706.674 msecs"
16  sequential burns: "Elapsed time: 8748.006 msecs"
16  sequential burns: "Elapsed time: 8729.54 msecs"
16  burns via pmap: "Elapsed time: 46022.281 msecs"
16  burns via pmap: "Elapsed time: 44845.725 msecs"
16  burns via pmap: "Elapsed time: 45393.156 msecs"
16  burns via futures: "Elapsed time: 52822.863 msecs"
16  burns via futures: "Elapsed time: 50647.708 msecs"
16  burns via futures: "Elapsed time: 50337.916 msecs"
16  burns via agents: "Elapsed time: 48615.905 msecs"
16  burns via agents: "Elapsed time: 56703.723 msecs"
16  burns via agents: "Elapsed time: 69765.913 msecs"
48  sequential burns: "Elapsed time: 38885.616 msecs"
48  sequential burns: "Elapsed time: 38651.573 msecs"
48  sequential burns: "Elapsed time: 36669.02 msecs"
48  burns via pmap: "Elapsed time: 169108.022 msecs"
48  burns via pmap: "Elapsed time: 176656.455 msecs"
48  burns via pmap: "Elapsed time: 182119.986 msecs"
48  burns via futures: "Elapsed time: 176764.722 msecs"
48  burns via futures: "Elapsed time: 169257.577 msecs"
48  burns via futures: "Elapsed time: 157205.693 msecs"
48  burns via agents: "Elapsed time: 140618.333 msecs"
48  burns via agents: "Elapsed time: 137992.773 msecs"
48  burns via agents: "Elapsed time: 143153.696 msecs"

Here we have a very depressing picture. Although I wouldn't expect to get any 
speedup from concurrency the concurrency-related slowdowns have now spread to 
all of my concurrency-starting methods with all numbers of burns. It is way way 
way worse to be using the concurrency methods than the straightforward 
sequential method in every circumstance. Again, I understand why one should 
expect a small loss in a case like this, but these are huge losses and the 
number of threads that have to be coordinated (with no shared) is quite small 
-- just 4-48.

My guess is that all of this is stemming from some confusion on my part about 
how I should be starting and managing concurrent processes, and my greatest 
hope is that one of you will show me an alternative to my burn-via-* functions 
that provides a speedup nearly linear with the number of cores and only a 
negligible loss when there's only one core available...

But any help of any kind would be appreciated.

Thanks,

 -Lee

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspec...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

Check out Genetic Programming and Evolvable Machines:
http://www.springer.com/10710 - http://gpemjournal.blogspot.com/

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

starting and getting the most out of concurrent processes

Reply via email to