Re: starting and getting the most out of concurrent processes

Lee Spector Wed, 04 Aug 2010 11:29:26 -0700

In Clojure 1.1.0 (which is what I have running on the big machines) I get a 
warning and then an error from your ^Callable line:


WARNING: reader macro ^ is deprecated; use meta instead
Exception in thread "main" java.lang.IllegalArgumentException: let requires an 
even number of forms in binding vector (concur.clj:42)

What's the right way to patch that?

 -Lee


On Aug 4, 2010, at 2:08 PM, Armando Blancas wrote:

> What about a more direct way of creating your threads. This code is
> too simple and more is needed to collect results with futures, but I
> wonder how something like this would perform on your machine:
> 
> (defn burn-via-pool [n]
>  (print n " burns via a thread pool: ")
>  (time
>    (let [cores (.. Runtime getRuntime availableProcessors)
>          pool (java.util.concurrent.Executors/newFixedThreadPool
> cores)
>          ^Callable func (fn [] (burn))]
>         (dotimes [_ n] (.submit pool func))
>      (.shutdown pool)
>      (.awaitTermination pool 1 java.util.concurrent.TimeUnit/
> HOURS))))
> 
> On Aug 4, 7:36 am, Lee Spector <lspec...@hampshire.edu> wrote:
>> Apologies for the length of this message -- I'm hoping to be complete, but 
>> that made the message pretty long.
>> 
>> Also BTW most of the tests below were run using Clojure 1.1. If part of the 
>> answer to my questions is "use 1.2" then I'll upgrade ASAP (but I haven't 
>> done so yet because I'd prefer to be confused by one thing at a time :-). I 
>> don't think that can be the full answer, though, since the last batch of 
>> runs below WERE run under 1.2 and they're also problematic...
>> 
>> Also, for most of the runs described here (with the one exception noted 
>> below) I am running under Linux:
>> 
>> [lspec...@fly ~]$ cat /proc/version
>> Linux version 2.6.18-164.6.1.el5 (mockbu...@builder10.centos.org) (gcc 
>> version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Nov 3 16:12:36 EST 2009
>> 
>> with this Java version:
>> 
>> [lspec...@fly ~]$ java -version
>> java version "1.6.0_16"
>> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
>> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>> 
>> SO: Most of the documentation and discussion about clojure concurrency is 
>> about managing state that may be shared between concurrent processes, but I 
>> have what I guess are more basic questions about how concurrent processes 
>> can/should be started even in the absence of shared state (or when all 
>> that's shared is immutable) and about how to get the most out of concurrency 
>> on multiple cores.
>> 
>> I often have large numbers of relatively long, independent processes and I 
>> want to farm them out to multiple cores. (For those who care this is often 
>> in the context of evolutionary computation systems, with each of the 
>> processes being a fitness test.) I had thought that I was farming these out 
>> in the right way to multiple cores, using agents or sometimes just pmap, but 
>> then I noticed that my runtimes weren't scaling in the way that I expected 
>> across machines with different numbers of cores (even though I usually saw 
>> near total utilization of all cores in "top").
>> 
>> This led me to do some more systematic testing and I'm confused/concerned 
>> about what I'm seeing, so I'm going to present my tests and results here in 
>> the hope that someone can clear things up for me. I know that timing things 
>> in clojure can be complicated both on account of laziness and on account of 
>> optimizations that happen on the Java side, but I think I've done the right 
>> things to avoid getting tripped up too much by these issues. Still, it's 
>> quite possible that I've coded some things incorrectly and/or that I'm 
>> misunderstanding some basic concepts, and I'd appreciate any help that 
>> anyone can provide.
>> 
>> First I defined a function that would take a non-trivial amount of time to 
>> execute, as follows:
>> 
>> (defn burn
>>   ([] (count
>>         (take 1E6
>>           (repeatedly
>>             #(* 9999999999 9999999999)))))
>>   ([_] (burn)))
>> 
>> The implementation with an ignored argument just serves to make some of my 
>> later calls neater -- I suppose I might incur a tiny additional cost when 
>> calling it that way but this will be swamped by the things I'm timing.
>> 
>> Then I defined functions for calling this multiple times either sequentially 
>> or concurrently, using three different techniques for starting the 
>> concurrent processes:
>> 
>> (defn burn-sequentially [n]
>>   (print n " sequential burns: ")
>>   (time (dotimes [i n] (burn))))
>> 
>> (defn burn-via-pmap [n]
>>   (print n " burns via pmap: ")
>>   (time (doall (pmap burn (range n)))))
>> 
>> (defn burn-via-futures [n]
>>   (print n " burns via futures: ")
>>   (time (doall (pmap deref (map (fn [_] (future (burn)))
>>                                                   (range n))))))
>> 
>> (defn burn-via-agents [n]
>>   (print n " burns via agents: ")
>>   (time (let [agents (map #(agent %) (range n))]
>>           (dorun (map #(send % burn) agents))
>>           (apply await agents))))
>> 
>> Finally, since there's often quite a bit of variability in the run time of 
>> these things (maybe because of garbage collection? Optimization? I'm not 
>> sure), I define a simple macro to execute a call three times:
>> 
>> (defmacro thrice [expression]
>>   `(do ~expression ~expression ~expression))
>> 
>> Now I can do some timings, and I'll first show you what happens in one of 
>> the cases where everything performs as expected.
>> 
>> On a 16-core machine (details 
>> athttp://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-1.l...), 
>> running four burns thrice, with the code:
>> 
>> (thrice (burn-sequentially 4))
>> (thrice (burn-via-pmap 4))
>> (thrice (burn-via-futures 4))
>> (thrice (burn-via-agents 4))
>> 
>> I get:
>> 
>> 4  sequential burns: "Elapsed time: 2308.616 msecs"
>> 4  sequential burns: "Elapsed time: 1510.207 msecs"
>> 4  sequential burns: "Elapsed time: 1182.743 msecs"
>> 4  burns via pmap: "Elapsed time: 470.988 msecs"
>> 4  burns via pmap: "Elapsed time: 457.015 msecs"
>> 4  burns via pmap: "Elapsed time: 446.84 msecs"
>> 4  burns via futures: "Elapsed time: 417.368 msecs"
>> 4  burns via futures: "Elapsed time: 401.444 msecs"
>> 4  burns via futures: "Elapsed time: 398.786 msecs"
>> 4  burns via agents: "Elapsed time: 421.103 msecs"
>> 4  burns via agents: "Elapsed time: 426.775 msecs"
>> 4  burns via agents: "Elapsed time: 408.416 msecs"
>> 
>> The improvement from the first line to the second is something I always see 
>> (along with frequent improvements across the three calls in a "thrice"), and 
>> I assume this is due to optimizations talking place in the JVM. Then we see 
>> that all of the ways of starting concurrent burns perform about the same, 
>> and all produce a speedup over the sequential burns of somewhere in the 
>> neighborhood of 3x-4x. Pretty much exactly what I would expect and want. So 
>> far so good.
>> 
>> However, in the same JVM launch I then went on to do the same thing but with 
>> 16 and then 48 burns in each call:
>> 
>> (thrice (burn-sequentially 16))
>> (thrice (burn-via-pmap 16))
>> (thrice (burn-via-futures 16))
>> (thrice (burn-via-agents 16))
>> 
>> (thrice (burn-sequentially 48))
>> (thrice (burn-via-pmap 48))
>> (thrice (burn-via-futures 48))
>> (thrice (burn-via-agents 48))
>> 
>> This produced:
>> 
>> 16  sequential burns: "Elapsed time: 5821.574 msecs"
>> 16  sequential burns: "Elapsed time: 6580.684 msecs"
>> 16  sequential burns: "Elapsed time: 6648.013 msecs"
>> 16  burns via pmap: "Elapsed time: 5953.194 msecs"
>> 16  burns via pmap: "Elapsed time: 7517.196 msecs"
>> 16  burns via pmap: "Elapsed time: 7380.047 msecs"
>> 16  burns via futures: "Elapsed time: 1168.827 msecs"
>> 16  burns via futures: "Elapsed time: 1068.98 msecs"
>> 16  burns via futures: "Elapsed time: 1048.745 msecs"
>> 16  burns via agents: "Elapsed time: 1041.05 msecs"
>> 16  burns via agents: "Elapsed time: 1030.712 msecs"
>> 16  burns via agents: "Elapsed time: 1041.139 msecs"
>> 48  sequential burns: "Elapsed time: 15909.333 msecs"
>> 48  sequential burns: "Elapsed time: 14825.631 msecs"
>> 48  sequential burns: "Elapsed time: 15232.646 msecs"
>> 48  burns via pmap: "Elapsed time: 13586.897 msecs"
>> 48  burns via pmap: "Elapsed time: 3106.56 msecs"
>> 48  burns via pmap: "Elapsed time: 3041.272 msecs"
>> 48  burns via futures: "Elapsed time: 2968.991 msecs"
>> 48  burns via futures: "Elapsed time: 2895.506 msecs"
>> 48  burns via futures: "Elapsed time: 2818.724 msecs"
>> 48  burns via agents: "Elapsed time: 2802.906 msecs"
>> 48  burns via agents: "Elapsed time: 2754.364 msecs"
>> 48  burns via agents: "Elapsed time: 2743.038 msecs"
>> 
>> Looking first at the 16-burn runs, we see that concurrency via pmap is 
>> actually generally WORSE than sequential. I cannot understand why this 
>> should be the case. I guess if I were running on a single core I would 
>> expect to see a slight loss when going to pmap because there would be some 
>> cost for managing the 16 threads that wouldn't be compensated for by actual 
>> concurrency. But I'm running on 16 cores and I should be getting a major 
>> speedup, not a slowdown. There are only 16 threads, so there shouldn't be a 
>> lot of time lost to overhead.
>> 
>> Also interesting, in this case when I start the processes using futures or 
>> agents I DO see a speedup. It's on the order of 6x-7x, not close to the 16x 
>> that I would hope for, but at least it's a speedup. Why is this so different 
>> from the case with pmap? (Recall that my pmap-based method DID produce about 
>> the same speedup as my other methods when doing only 4 burns.)
>> 
>> For the calls with 48 burns we again see nearly the expected, reasonably 
>> good pattern with all concurrent calls performing nearly equivalently (I 
>> suppose that the steady improvement over all of the calls is again some kind 
>> of JVM optimization), with a speedup in the concurrent calls over the 
>> sequential calls in the neighborhood of 5x-6x. Again, not the ~16x that I 
>> might hope for, but at least it's in the right direction. The very first of 
>> the pmap calls with 48 burns is an anomaly, with only a slight improvement 
>> over the sequential calls, so I suppose that's another small mystery.
>> 
>> The big mystery so far, however, is in the case of the 16 burns via pmap, 
>> which is bizarrely slow on this 16-core machine.
>> 
>> Next I tried the same thing on a 48 core machine 
>> (http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-2.l...). 
>> Here I got:
>> 
>> 4  sequential burns: "Elapsed time: 3062.871 msecs"
>> 4  sequential burns: "Elapsed time: 2249.048 msecs"
>> 4  sequential burns: "Elapsed time: 2417.677 msecs"
>> 4  burns via pmap: "Elapsed time: 705.968 msecs"
>> 4  burns via pmap: "Elapsed time: 679.865 msecs"
>> 4  burns via pmap: "Elapsed time: 685.017 msecs"
>> 4  burns via futures: "Elapsed time: 687.097 msecs"
>> 4  burns via futures: "Elapsed time: 636.543 msecs"
>> 4  burns via futures: "Elapsed time: 660.116 msecs"
>> 4  burns via agents: "Elapsed time: 708.163 msecs"
>> 4  burns via agents: "Elapsed time: 709.433 msecs"
>> 4  burns via agents: "Elapsed time: 713.536 msecs"
>> 16  sequential burns: "Elapsed time: 8065.446 msecs"
>> 16  sequential burns: "Elapsed time: 8069.239 msecs"
>> 16  sequential burns: "Elapsed time: 8102.791 msecs"
>> 16  burns via pmap: "Elapsed time: 11288.757 msecs"
>> 16  burns via pmap: "Elapsed time: 12182.506 msecs"
>> 16  burns via pmap: "Elapsed time: 14609.397 msecs"
>> 16  burns via futures: "Elapsed time: 2519.603 msecs"
>> 16  burns via futures: "Elapsed time: 2436.699 msecs"
>> 16  burns via futures: "Elapsed time: 2776.869 msecs"
>> 16  burns via agents: "Elapsed time:
>> ...
>> 
>> read more »
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspec...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

Check out Genetic Programming and Evolvable Machines:
http://www.springer.com/10710 - http://gpemjournal.blogspot.com/

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: starting and getting the most out of concurrent processes

Reply via email to