Even though this is very surprising (and sad) to hear, I'm afraid I've
got different experiences... My reducer-based parallel minimax is about
3x faster than the serial one, on my 4-core AMD phenom II and a tiny bit
faster on my girlfriend's intel i5 (2 physical cores + 2 virtual). I'm
suspecting due to slightly larger L3 cache... So no complaints on my
part from this...
Now, with regards to pmap....The only occassions I'm finding pmap useful
is for coarse-grain concurrency...In other words I'm only using pmap
where i 'd use Executors. Usually, when mapping a very expensive fn
(that needs no coordination) across a very large collection of roughly
equally-sized elements. In these cases I'm getting proper linear speedup
and i don't see why wouldn't I. If you're familiar with text-mining (I
think you are) consider annotating a collection of documents (500 -
1000) based on some dictionaries and probabilistic taggers. This could
take more than 15 seconds per document so pmap does pay off as you would
expect.
I've not done any experiments on any clusters with reducers but i
wouldn't expect them to perform well when distributed across several
nodes. Basically you 'd have horrible locality for something that
requires coordination (like minimax). However, if you could run
something like a genetic-algorithm across several nodes and minimax on
each node then I'd expect good data locality thus tremendous speedup.
This is exactly what I'm planning of doing for training my chess
neural-net. Using 20-30 8core machines would be ideal for this use-case
(20-30 individuals in the population - each lifetime (5 games) running
on a separate machine that is entirely devoted to minimax)...
Anyway, like you, i am baffled by your experiences...You say that
monitoring cpus shows that they are all busy throughout the entire task.
If this is truly the case then I really don't understand! Otherwise, as
you say "pmap may leave cores idle for a bit if some tasks take a lot
longer than others"...this is why I said "equally-sized" chunks
before...make sure you monitor your cpus closely to verify that they are
indeed busy....This has bitten me in the past... Of course, in your case
(range 8) indeed contains 'equally-sized' elemements so i don't know
what to think! In any case there might be something going on that we're
not seeing...
Jim
On 08/12/12 01:25, Lee Spector wrote:
I've been running compute intensive (multi-day), highly parallelizable Clojure processes
on high-core-count machines and blithely assuming that since I saw near maximal CPU
utilization in "top" and the like that I was probably getting good speedups.
But a colleague recently did some tests and the results are really quite
alarming.
On intel machines we're seeing speedups but much less than I expected -- about
a 2x speedup going from 1 to 8 cores.
But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost
twice as long on 8 cores as on 1.
I'm baffled, and unhappy that my runs are probably going slower on 48-core and
64-core nodes than on single-core nodes.
It's possible that I'm just doing something wrong in the way that I dispatch
the tasks, or that I've missed some Clojure or JVM setting... but right now I'm
mystified and would really appreciate some help.
I'm aware that there's overhead for multicore distribution and that one can
expect slowdowns if the computations that are being distributed are fast
relative to the dispatch overhead, but this should not be the case here. We're
distributing computations that take seconds or minutes, and not huge numbers of
them (at least in our tests while trying to figure out what's going on).
I'm also aware that the test that produced the data I give below, insofar as it
uses pmap to do the distribution, may leave cores idle for a bit if some tasks
take a lot longer than others, because of the way that pmap allocates cores to
threads. But that also shouldn't be a big issue here because for this test all
of the threads are doing the exact same computation. And I also tried using an
agent-based dispatch approach that shouldn't have the pmap thread allocation
issue, and the results were about the same.
Note also that all of the computations in this test are purely functional and
independent -- there shouldn't be any resource contention issues.
The test: I wrote a time-consuming function that just does a bunch of math and
list manipulation (which is what takes a lot of time in my real applications):
(defn burn
([] (loop [i 0
value '()]
(if (>= i 10000)
(count (last (take 10000 (iterate reverse value))))
(recur (inc i)
(cons
(* (int i)
(+ (float i)
(- (int i)
(/ (float i)
(inc (int i))))))
value)))))
([_] (burn)))
Then I have a main function like this:
(defn -main
[& args]
(time (doall (pmap burn (range 8))))
(System/exit 0))
We run it with "lein run" (we've tried both leingingen 1.7.1 and
2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also tried Java
1.6.0_22. We've tried various JVM memory options (via :jvm-opts with -Xmx and -Xms
settings) and also with and without -XX:+UseParallelGC. None of this seems to change the
picture substantially.
The results that we get generally look like this:
- On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code
above, it takes about 45 seconds (and all cores appear to be fully loaded as it
does so). If we change the pmap to just plain map, so that we use only a single
core, the time goes up to about 1 minute and 36 seconds. So the speedup for 8
cores is just about 2x, even though there are 8 completely independent tasks.
So that's pretty depressing.
- But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, running
the same test (with pmap) takes about 4 minutes and 2 seconds. That's really
slow! Changing the pmap to map here produces a runtime of about 2 minutes and
20 seconds. So it's quite a bit faster on one core than on 8! And all of these
times are terrible compared to those on the intel.
Another strange observation is that we can run multiple instances of the test on the same machine
and (up to some limit, presumably) they don't seem to slow each other down, even though just one
instance of the test appears to be maxing out all of the CPU according to "top". I
suppose that means that "top" isn't telling me what I thought -- my colleague says it can
mean that something is blocked in some way with a full instruction queue. But I'm not interested in
running multiple instances. I have single computations that involve multiple expensive but
independent subcomputations, and I want to farm those subcomputations out to multiple cores -- and
get speedups as a result. My subcomputations are so completely independent that I think I should be
able to get speedups approaching a factor of n for n cores, but what I see is a factor of only
about 2 on intel machines, and a bizarre factor of about 1/2 on AMD machines.
Any help would be greatly appreciated!
Thanks,
-Lee
--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspec...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en