Re: starting and getting the most out of concurrent processes

Laurent PETIT Wed, 04 Aug 2010 09:12:46 -0700

2010/8/4 David Nolen <dnolen.li...@gmail.com>

> Have you considered that you're realizing very large lazy sequences and
> might be thrashing around in GC ? The parallel versions needs X times the
> available memory of the sequential version, where X is the number of
> concurrent threads right?



David, I don't think so, the burn function does not seem to hold onto the
head.

There's indeed a "potential" problem in the pmap version since it holds onto
the head of n-sized sequence, but since n ranges from 4 to 32 it hardly can
be the problem.

Lee, I don't have the general answer, but as a related note, I think there
may be a problem with the "with futures" version:

a. you quickly "bootstrap" all futures in the inner call to map
b. you collect the results of the futures in parallel in the outer call to
pmap

as a reminder:
(defn burn-via-futures [n]
 (print n " burns via futures: ")
 (time (doall (pmap deref (map (fn [_] (future (burn)))
                                                 (range n))))))

problem #1: since map is lazy, the bootstrapping of the futures will follow
the consumption of the seq by pmap (modulo chunked seq behavior). So to
quickly bootstrap all your futures before passing the seq to pmap, you
should wrap the (map) inside a doall
problem #2: maybe deref is a quick enough operation that using pmap with
deref does not make sense (or would make sense if the number of cores were
realllly big, e.g. if the coll would be of size 1.000.000 and the number of
core of the same magnitude order.



> David
>
>
> On Wed, Aug 4, 2010 at 10:36 AM, Lee Spector <lspec...@hampshire.edu>wrote:
>
>>
>> Apologies for the length of this message -- I'm hoping to be complete, but
>> that made the message pretty long.
>>
>> Also BTW most of the tests below were run using Clojure 1.1. If part of
>> the answer to my questions is "use 1.2" then I'll upgrade ASAP (but I
>> haven't done so yet because I'd prefer to be confused by one thing at a time
>> :-). I don't think that can be the full answer, though, since the last batch
>> of runs below WERE run under 1.2 and they're also problematic...
>>
>> Also, for most of the runs described here (with the one exception noted
>> below) I am running under Linux:
>>
>> [lspec...@fly ~]$ cat /proc/version
>> Linux version 2.6.18-164.6.1.el5 (mockbu...@builder10.centos.org) (gcc
>> version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Nov 3 16:12:36 EST
>> 2009
>>
>> with this Java version:
>>
>> [lspec...@fly ~]$ java -version
>> java version "1.6.0_16"
>> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
>> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>>
>> SO: Most of the documentation and discussion about clojure concurrency is
>> about managing state that may be shared between concurrent processes, but I
>> have what I guess are more basic questions about how concurrent processes
>> can/should be started even in the absence of shared state (or when all
>> that's shared is immutable) and about how to get the most out of concurrency
>> on multiple cores.
>>
>> I often have large numbers of relatively long, independent processes and I
>> want to farm them out to multiple cores. (For those who care this is often
>> in the context of evolutionary computation systems, with each of the
>> processes being a fitness test.) I had thought that I was farming these out
>> in the right way to multiple cores, using agents or sometimes just pmap, but
>> then I noticed that my runtimes weren't scaling in the way that I expected
>> across machines with different numbers of cores (even though I usually saw
>> near total utilization of all cores in "top").
>>
>> This led me to do some more systematic testing and I'm confused/concerned
>> about what I'm seeing, so I'm going to present my tests and results here in
>> the hope that someone can clear things up for me. I know that timing things
>> in clojure can be complicated both on account of laziness and on account of
>> optimizations that happen on the Java side, but I think I've done the right
>> things to avoid getting tripped up too much by these issues. Still, it's
>> quite possible that I've coded some things incorrectly and/or that I'm
>> misunderstanding some basic concepts, and I'd appreciate any help that
>> anyone can provide.
>>
>> First I defined a function that would take a non-trivial amount of time to
>> execute, as follows:
>>
>> (defn burn
>>  ([] (count
>>        (take 1E6
>>          (repeatedly
>>            #(* 9999999999 9999999999)))))
>>  ([_] (burn)))
>>
>> The implementation with an ignored argument just serves to make some of my
>> later calls neater -- I suppose I might incur a tiny additional cost when
>> calling it that way but this will be swamped by the things I'm timing.
>>
>> Then I defined functions for calling this multiple times either
>> sequentially or concurrently, using three different techniques for starting
>> the concurrent processes:
>>
>> (defn burn-sequentially [n]
>>  (print n " sequential burns: ")
>>  (time (dotimes [i n] (burn))))
>>
>> (defn burn-via-pmap [n]
>>  (print n " burns via pmap: ")
>>  (time (doall (pmap burn (range n)))))
>>
>> (defn burn-via-futures [n]
>>  (print n " burns via futures: ")
>>  (time (doall (pmap deref (map (fn [_] (future (burn)))
>>                                                  (range n))))))
>>
>> (defn burn-via-agents [n]
>>  (print n " burns via agents: ")
>>  (time (let [agents (map #(agent %) (range n))]
>>          (dorun (map #(send % burn) agents))
>>          (apply await agents))))
>>
>> Finally, since there's often quite a bit of variability in the run time of
>> these things (maybe because of garbage collection? Optimization? I'm not
>> sure), I define a simple macro to execute a call three times:
>>
>> (defmacro thrice [expression]
>>  `(do ~expression ~expression ~expression))
>>
>> Now I can do some timings, and I'll first show you what happens in one of
>> the cases where everything performs as expected.
>>
>> On a 16-core machine (details at
>> http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-1.local),
>> running four burns thrice, with the code:
>>
>> (thrice (burn-sequentially 4))
>> (thrice (burn-via-pmap 4))
>> (thrice (burn-via-futures 4))
>> (thrice (burn-via-agents 4))
>>
>> I get:
>>
>> 4  sequential burns: "Elapsed time: 2308.616 msecs"
>> 4  sequential burns: "Elapsed time: 1510.207 msecs"
>> 4  sequential burns: "Elapsed time: 1182.743 msecs"
>> 4  burns via pmap: "Elapsed time: 470.988 msecs"
>> 4  burns via pmap: "Elapsed time: 457.015 msecs"
>> 4  burns via pmap: "Elapsed time: 446.84 msecs"
>> 4  burns via futures: "Elapsed time: 417.368 msecs"
>> 4  burns via futures: "Elapsed time: 401.444 msecs"
>> 4  burns via futures: "Elapsed time: 398.786 msecs"
>> 4  burns via agents: "Elapsed time: 421.103 msecs"
>> 4  burns via agents: "Elapsed time: 426.775 msecs"
>> 4  burns via agents: "Elapsed time: 408.416 msecs"
>>
>> The improvement from the first line to the second is something I always
>> see (along with frequent improvements across the three calls in a "thrice"),
>> and I assume this is due to optimizations talking place in the JVM. Then we
>> see that all of the ways of starting concurrent burns perform about the
>> same, and all produce a speedup over the sequential burns of somewhere in
>> the neighborhood of 3x-4x. Pretty much exactly what I would expect and want.
>> So far so good.
>>
>> However, in the same JVM launch I then went on to do the same thing but
>> with 16 and then 48 burns in each call:
>>
>> (thrice (burn-sequentially 16))
>> (thrice (burn-via-pmap 16))
>> (thrice (burn-via-futures 16))
>> (thrice (burn-via-agents 16))
>>
>> (thrice (burn-sequentially 48))
>> (thrice (burn-via-pmap 48))
>> (thrice (burn-via-futures 48))
>> (thrice (burn-via-agents 48))
>>
>> This produced:
>>
>> 16  sequential burns: "Elapsed time: 5821.574 msecs"
>> 16  sequential burns: "Elapsed time: 6580.684 msecs"
>> 16  sequential burns: "Elapsed time: 6648.013 msecs"
>> 16  burns via pmap: "Elapsed time: 5953.194 msecs"
>> 16  burns via pmap: "Elapsed time: 7517.196 msecs"
>> 16  burns via pmap: "Elapsed time: 7380.047 msecs"
>> 16  burns via futures: "Elapsed time: 1168.827 msecs"
>> 16  burns via futures: "Elapsed time: 1068.98 msecs"
>> 16  burns via futures: "Elapsed time: 1048.745 msecs"
>> 16  burns via agents: "Elapsed time: 1041.05 msecs"
>> 16  burns via agents: "Elapsed time: 1030.712 msecs"
>> 16  burns via agents: "Elapsed time: 1041.139 msecs"
>> 48  sequential burns: "Elapsed time: 15909.333 msecs"
>> 48  sequential burns: "Elapsed time: 14825.631 msecs"
>> 48  sequential burns: "Elapsed time: 15232.646 msecs"
>> 48  burns via pmap: "Elapsed time: 13586.897 msecs"
>> 48  burns via pmap: "Elapsed time: 3106.56 msecs"
>> 48  burns via pmap: "Elapsed time: 3041.272 msecs"
>> 48  burns via futures: "Elapsed time: 2968.991 msecs"
>> 48  burns via futures: "Elapsed time: 2895.506 msecs"
>> 48  burns via futures: "Elapsed time: 2818.724 msecs"
>> 48  burns via agents: "Elapsed time: 2802.906 msecs"
>> 48  burns via agents: "Elapsed time: 2754.364 msecs"
>> 48  burns via agents: "Elapsed time: 2743.038 msecs"
>>
>> Looking first at the 16-burn runs, we see that concurrency via pmap is
>> actually generally WORSE than sequential. I cannot understand why this
>> should be the case. I guess if I were running on a single core I would
>> expect to see a slight loss when going to pmap because there would be some
>> cost for managing the 16 threads that wouldn't be compensated for by actual
>> concurrency. But I'm running on 16 cores and I should be getting a major
>> speedup, not a slowdown. There are only 16 threads, so there shouldn't be a
>> lot of time lost to overhead.
>>
>> Also interesting, in this case when I start the processes using futures or
>> agents I DO see a speedup. It's on the order of 6x-7x, not close to the 16x
>> that I would hope for, but at least it's a speedup. Why is this so different
>> from the case with pmap? (Recall that my pmap-based method DID produce about
>> the same speedup as my other methods when doing only 4 burns.)
>>
>> For the calls with 48 burns we again see nearly the expected, reasonably
>> good pattern with all concurrent calls performing nearly equivalently (I
>> suppose that the steady improvement over all of the calls is again some kind
>> of JVM optimization), with a speedup in the concurrent calls over the
>> sequential calls in the neighborhood of 5x-6x. Again, not the ~16x that I
>> might hope for, but at least it's in the right direction. The very first of
>> the pmap calls with 48 burns is an anomaly, with only a slight improvement
>> over the sequential calls, so I suppose that's another small mystery.
>>
>> The big mystery so far, however, is in the case of the 16 burns via pmap,
>> which is bizarrely slow on this 16-core machine.
>>
>> Next I tried the same thing on a 48 core machine (
>> http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-2.local).
>> Here I got:
>>
>> 4  sequential burns: "Elapsed time: 3062.871 msecs"
>> 4  sequential burns: "Elapsed time: 2249.048 msecs"
>> 4  sequential burns: "Elapsed time: 2417.677 msecs"
>> 4  burns via pmap: "Elapsed time: 705.968 msecs"
>> 4  burns via pmap: "Elapsed time: 679.865 msecs"
>> 4  burns via pmap: "Elapsed time: 685.017 msecs"
>> 4  burns via futures: "Elapsed time: 687.097 msecs"
>> 4  burns via futures: "Elapsed time: 636.543 msecs"
>> 4  burns via futures: "Elapsed time: 660.116 msecs"
>> 4  burns via agents: "Elapsed time: 708.163 msecs"
>> 4  burns via agents: "Elapsed time: 709.433 msecs"
>> 4  burns via agents: "Elapsed time: 713.536 msecs"
>> 16  sequential burns: "Elapsed time: 8065.446 msecs"
>> 16  sequential burns: "Elapsed time: 8069.239 msecs"
>> 16  sequential burns: "Elapsed time: 8102.791 msecs"
>> 16  burns via pmap: "Elapsed time: 11288.757 msecs"
>> 16  burns via pmap: "Elapsed time: 12182.506 msecs"
>> 16  burns via pmap: "Elapsed time: 14609.397 msecs"
>> 16  burns via futures: "Elapsed time: 2519.603 msecs"
>> 16  burns via futures: "Elapsed time: 2436.699 msecs"
>> 16  burns via futures: "Elapsed time: 2776.869 msecs"
>> 16  burns via agents: "Elapsed time: 2178.028 msecs"
>> 16  burns via agents: "Elapsed time: 2871.38 msecs"
>> 16  burns via agents: "Elapsed time: 2244.687 msecs"
>> 48  sequential burns: "Elapsed time: 24118.218 msecs"
>> 48  sequential burns: "Elapsed time: 24096.667 msecs"
>> 48  sequential burns: "Elapsed time: 24057.327 msecs"
>> 48  burns via pmap: "Elapsed time: 10369.224 msecs"
>> 48  burns via pmap: "Elapsed time: 6837.071 msecs"
>> 48  burns via pmap: "Elapsed time: 4163.926 msecs"
>> 48  burns via futures: "Elapsed time: 3980.298 msecs"
>> 48  burns via futures: "Elapsed time: 4066.35 msecs"
>> 48  burns via futures: "Elapsed time: 4068.199 msecs"
>> 48  burns via agents: "Elapsed time: 4012.069 msecs"
>> 48  burns via agents: "Elapsed time: 4052.759 msecs"
>> 48  burns via agents: "Elapsed time: 4085.018 msecs"
>>
>> Essentially this is the same picture that I got on the 16-core machine:
>> decent (but less than I would like -- only something like 3x-4x) speedups
>> with most concurrent methods in most cases but a bizarre anomaly with 16
>> burns started with pmap, which is again considerably slower than the
>> sequential runs. Why should this be? When I run only 4 burns or a full 48
>> burns the pmap method performs decently (that is, at least things get faster
>> than the sequential calls), but with 16 burns something very odd happens.
>>
>> Finally, I ran the same thing on my MacBook Pro 3.06 GHz Intel Core 2 Duo,
>> Mac OS X 10.6.4, with Clojure 1.2.0-master-SNAPSHOT under
>> Eclipse/Counterclockwise, with a bunch of applications running, so probably
>> this is acting more or less like a single core machine, and got:
>>
>> 4  sequential burns: "Elapsed time: 3487.224 msecs"
>> 4  sequential burns: "Elapsed time: 2327.569 msecs"
>> 4  sequential burns: "Elapsed time: 2137.697 msecs"
>> 4  burns via pmap: "Elapsed time: 12478.725 msecs"
>> 4  burns via pmap: "Elapsed time: 12815.75 msecs"
>> 4  burns via pmap: "Elapsed time: 8464.909 msecs"
>> 4  burns via futures: "Elapsed time: 11494.17 msecs"
>> 4  burns via futures: "Elapsed time: 12365.537 msecs"
>> 4  burns via futures: "Elapsed time: 12098.571 msecs"
>> 4  burns via agents: "Elapsed time: 10361.749 msecs"
>> 4  burns via agents: "Elapsed time: 12458.174 msecs"
>> 4  burns via agents: "Elapsed time: 9016.093 msecs"
>> 16  sequential burns: "Elapsed time: 8706.674 msecs"
>> 16  sequential burns: "Elapsed time: 8748.006 msecs"
>> 16  sequential burns: "Elapsed time: 8729.54 msecs"
>> 16  burns via pmap: "Elapsed time: 46022.281 msecs"
>> 16  burns via pmap: "Elapsed time: 44845.725 msecs"
>> 16  burns via pmap: "Elapsed time: 45393.156 msecs"
>> 16  burns via futures: "Elapsed time: 52822.863 msecs"
>> 16  burns via futures: "Elapsed time: 50647.708 msecs"
>> 16  burns via futures: "Elapsed time: 50337.916 msecs"
>> 16  burns via agents: "Elapsed time: 48615.905 msecs"
>> 16  burns via agents: "Elapsed time: 56703.723 msecs"
>> 16  burns via agents: "Elapsed time: 69765.913 msecs"
>> 48  sequential burns: "Elapsed time: 38885.616 msecs"
>> 48  sequential burns: "Elapsed time: 38651.573 msecs"
>> 48  sequential burns: "Elapsed time: 36669.02 msecs"
>> 48  burns via pmap: "Elapsed time: 169108.022 msecs"
>> 48  burns via pmap: "Elapsed time: 176656.455 msecs"
>> 48  burns via pmap: "Elapsed time: 182119.986 msecs"
>> 48  burns via futures: "Elapsed time: 176764.722 msecs"
>> 48  burns via futures: "Elapsed time: 169257.577 msecs"
>> 48  burns via futures: "Elapsed time: 157205.693 msecs"
>> 48  burns via agents: "Elapsed time: 140618.333 msecs"
>> 48  burns via agents: "Elapsed time: 137992.773 msecs"
>> 48  burns via agents: "Elapsed time: 143153.696 msecs"
>>
>> Here we have a very depressing picture. Although I wouldn't expect to get
>> any speedup from concurrency the concurrency-related slowdowns have now
>> spread to all of my concurrency-starting methods with all numbers of burns.
>> It is way way way worse to be using the concurrency methods than the
>> straightforward sequential method in every circumstance. Again, I understand
>> why one should expect a small loss in a case like this, but these are huge
>> losses and the number of threads that have to be coordinated (with no
>> shared) is quite small -- just 4-48.
>>
>> My guess is that all of this is stemming from some confusion on my part
>> about how I should be starting and managing concurrent processes, and my
>> greatest hope is that one of you will show me an alternative to my
>> burn-via-* functions that provides a speedup nearly linear with the number
>> of cores and only a negligible loss when there's only one core available...
>>
>> But any help of any kind would be appreciated.
>>
>> Thanks,
>>
>>  -Lee
>>
>> --
>> Lee Spector, Professor of Computer Science
>> School of Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspec...@hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>> Check out Genetic Programming and Evolvable Machines:
>> http://www.springer.com/10710 - http://gpemjournal.blogspot.com/
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: starting and getting the most out of concurrent processes

Reply via email to