multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
Keeping the discussion here would make sense, esp. in light of meetup.com's 
horrible discussion board.

I don't have a lot to offer on the JVM/Clojure-specific problem beyond what I 
wrote in that meetup thread, but Lee's challenge(s) were too hard to resist:

 Would your conclusion be something like 'Intensive list processing can't 
 currently be done in Java (or Clojure/JVM) in a way that takes reasonable 
 advantage of multiple cores.'?


 I see that rewrite without lists might be the only way out, but that'd be 
 a bummer. Clojure is a multicore Lisp; if you can't use it for multicore list 
 processing then... sigh.

The nature of the `burn` program is such that I'm skeptical of the ability of 
any garbage-collected runtime (lispy or not) to scale its operation across 
multiple threads.  It generates a huge amount of garbage that is held for a 
very long time (in GC terms), thus producing a great deal of contention on the 
heap between the active job threads and the GC implementation constantly having 
to clean up after them, compact the results, etc.  The workload is not 
CPU-bound, so I don't see it being parallelized effectively.

I'd be very interested in seeing an implementation for a runtime that proves me 
wrong, i.e. can attain significant performance benefits by running the e.g. 8 
`burn` jobs across multiple threads.

In an attempt to do just that (prove myself wrong), I thought I'd golf the 
`burn` program (again; onlookers can go look at the Java implementation I 
knocked out a few days ago here: https://gist.github.com/4682554) on Racket, 
presumably as well-suited to multicore list processing as any other.  The 
results:

https://gist.github.com/4682453

Please forgive my perhaps pitiful Racket-fu.  I'm sure there's threadpool and 
CAS libraries available, but it was entertaining to hack away with the 
threading and semaphore primitives.  If I've goofed something up badly (it's 
been some time since I've schemed), please speak up.

The Racket and Java implementations are effectively equivalent:

`java -server -Xmx2g cemerick.burn $THREAD_COUNT`  (oracle java 1.7.0_09)
1 thread: 61s
2 threads: 126s

`java -server -Xmx2g cemerick.burn $THREAD_COUNT`  (oracle java 1.7.0_09)
1 thread: 47s
2 threads: 92s

`time ./racket -t ~/burn.scm -m $THREAD_COUNT`  (racket 5.3.1)
1 thread: 45s
2 threads: 126s

The above was run on OS X 10.7.5 with 4GB of physical memory.  If someone knows 
of ways to get better perf on racket, let me know.  I tried running from the 
results of `raco make` and such, but saw no improvements; I guess it is 
strictly producing bytecode that is later JIT'ed, and not doing any kind of 
additional AOT optimizations...

It'd be interesting to see timings from different operating systems, and very 
interesting to see timings from running burn.scm on other schemes, or a Common 
Lisp implementation and timings.

Cheers,

- Chas

On Jan 30, 2013, at 9:20 PM, Lee Spector wrote:

 
 FYI we had a bit of a discussion about this at a meetup in Amherst MA 
 yesterday, and while I'm not sufficiently on top of the JVM or system issues 
 to have briefed everyone on all of the details there has been a little of 
 followup since the discussion, including results of some different 
 experiments by Chas Emerick, at: 
 http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382
 
 -Lee
 
 On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote:
 
 Apologies for my very-slow reply here.  I keep thinking that I’ll have
 more time to look into this issue, and keep having other things
 requiring my attention.  And on top of that, I’ve temporarily lost the
 many-way AMD system I was using as a test-bed.
 
 I very much want to see if I can get my hands on an Intel system to
 compare to.  My AMD system is in theory 32-way – two physical CPUs, each
 with 16 cores.  However, Linux reports (via /proc/cpuinfo) the cores in
 groups of 8 (“cpu cores : 8” etc).  And something very strange happens
 when extending parallelism beyond 8-way...  I ran several experiments
 using a version of your whole-application benchmark I modified to
 control the level of parallelism.  At parallelism 9+, the real time it
 takes to complete the benchmark hardly budges, but the user/CPU time
 increases linearly with the level of parallelism!  As far as I can tell,
 multi-processor AMD *is* a NUMA architecture, which might potentially
 explain things.  But enabling the JVM NUMA options doesn’t seem to
 affect the benchmark.
 
 I think next steps are two-fold: (1) examine parallelism vs real  CPU
 time on an Intel system, and (2) attempt to reproduce the observed
 behavior in pure Java.  I’m keeping my fingers crossed that I’ll have
 some time to look at this more soon, but I’m honestly not very hopeful.
 
 In the mean time, I hope you’ve managed to exploit multi-process
 parallelism to run more efficiently?
 
 -Marshall
 
 -- 
 -- 
 You received this message because you are subscribed 

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Marshall Bockrath-Vandegrift
Chas Emerick c...@cemerick.com writes:

 Keeping the discussion here would make sense, esp. in light of
 meetup.com's horrible discussion board.

Excellent.  Solves the problem of deciding the etiquette of jumping on
the meetup board for a meetup one has never been involved in.  :-)

 The nature of the `burn` program is such that I'm skeptical of the
 ability of any garbage-collected runtime (lispy or not) to scale its
 operation across multiple threads.

Bringing you up to speed on this very long thread, I don’t believe the
original `burn` benchmark is a GC issue, due to results cameron first
reported here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ

I that narrowed to what I still believe to be the explanation here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ

And have more compact demonstration benchmark results here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ

I haven’t been able to produce those results in a minimal Java-only test
case, though.

Then Wm. Josiah posted a full-application benchmark, which appears to
have entirely different performance problems from the synthetic `burn`
benchmark.  I’d rejected GC as the cause for the slowdown there too, but
ATM can’t recall why or what I tested, so GC may definitely be a
candidate to re-examine:

https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ

Quite looking forward to additional insight...

-Marshall

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
On Jan 31, 2013, at 9:23 AM, Marshall Bockrath-Vandegrift wrote:

 Chas Emerick c...@cemerick.com writes:
 
 The nature of the `burn` program is such that I'm skeptical of the
 ability of any garbage-collected runtime (lispy or not) to scale its
 operation across multiple threads.
 
 Bringing you up to speed on this very long thread, I don’t believe the
 original `burn` benchmark is a GC issue, due to results cameron first
 reported here:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ
 
 I that narrowed to what I still believe to be the explanation here:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ
 
 And have more compact demonstration benchmark results here:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ
 
 I haven’t been able to produce those results in a minimal Java-only test
 case, though.
 
 Then Wm. Josiah posted a full-application benchmark, which appears to
 have entirely different performance problems from the synthetic `burn`
 benchmark.  I’d rejected GC as the cause for the slowdown there too, but
 ATM can’t recall why or what I tested, so GC may definitely be a
 candidate to re-examine:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ
 
 Quite looking forward to additional insight...

Yeah, the thread is far too large for me to reasonably digest (and there's 
inevitably a lot of noise in rounds of microbenchmarking golf ;-)

BTW, I just realized I copy/pasted the wrong command for the faster of the Java 
implementation benchmarks in the previous mail; it used -XX:-UseParallelGC:

`java -server -Xmx2g -XX:-UseParallelGC cemerick.burn $THREAD_COUNT`

I've not looked at clojush at all; it is AFAICT a complex application in its 
own right, and I wouldn't know where to start.  My understanding is that the 
`burn` function is considered representative of the bulk of operations in 
clojush, but I'm happy to assume that that's not the case.  There's likely a 
number of things that go into the results of the `burn` benchmark, nevermind 
the apparent performance of clojush.

Here's what I know:

* Two separate implementations (Java and Racket) exhibit very similar multicore 
performance characteristics.  In particular, the Java implementation would not 
suffer from any particular megamorphic inefficiencies, since no such calls are 
made in that program.  Racket is far more of a black box to me than the JVM, 
but it does not provide much of any polymorphism at all, so JIT-related issues 
are presumably not in scope there.

* Broadly speaking, heap allocation and its evil twin, GC, sabotages 
parallelism and performance in general.  Functional programming with immutable 
values has been made possible in large part by the gains registered by 
persistent data structures, structural sharing, lazy evaluation, and so on; 
without them, we're back to copy-on-write, which is roughly what `burn` 
represents in grand style.

Now, all this digital ink spilled, and just for kicks, I go back to run the 
Clojure `burn` again (https://gist.github.com/4683413 in a 1.5.0-RC2 REPL 
started with jvm args of `-Xmx2g -XX:-UseParallelGC`):

1 thread: 38s
2 threads: 32s
4 threads: 23s

Hardly a 2x or 4x improvement, but I had previously obtained badly degrading 
timings corresponding to those in my previous mail.  Microbenchmarking sucks.   
At least I got to twiddle with Racket again.  :-P

Cheers,

- Chas

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Lee Spector

On Jan 31, 2013, at 10:15 AM, Chas Emerick wrote:
 
 Then Wm. Josiah posted a full-application benchmark, which appears to
 have entirely different performance problems from the synthetic `burn`
 benchmark.  I’d rejected GC as the cause for the slowdown there too, but
 ATM can’t recall why or what I tested, so GC may definitely be a
 candidate to re-examine:
 
 
 I've not looked at clojush at all; it is AFAICT a complex application in its 
 own right, and I wouldn't know where to start.  My understanding is that the 
 `burn` function is considered representative of the bulk of operations in 
 clojush, but I'm happy to assume that that's not the case.  There's likely a 
 number of things that go into the results of the `burn` benchmark, nevermind 
 the apparent performance of clojush.

FWIW I wrote the burn benchmark because lots of intensive list manipulation 
is at the heart of our real applications, and that seemed to be a nicely 
minimal way to see how that could scale across cores.

Our real applications may indeed raise other issues too, but getting lots of 
intensive list manipulation to scale well is bound to be good.

FWIW the full-application benchmark that Josiah posted was made deterministic 
in a way that means that big chunks of our code won't actually be executing. 
The system generates and interprets lots of Push programs, and if I recall 
correctly the benchmark hardcodes all of the Push programs to be the same 
constant program, meaning that not all of the Push instructions will run, etc. 
It'd be trickier to get the real mix of instructions in a way that makes for 
an easily reproducible benchmark, and this benchmark is at least testing the 
basic interpreter and population-level infrastructure. If we can get this to 
scale reasonably then we could try to see how much that helps under 
more-completely real conditions.

 -Lee

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.