Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Dave Tenny
As a person who has recently been dabbling with clojure for evaluation 
purposes I wondered if anybody wanted to post some links about parallel 
clojure apps that have been clear and easy parallelism wins for the types 
of applications that clojure was designed for.  (To contrast the lengthy 
discussion and analysis of this topic that is *hopefully* the exception and 
not the rule).

Any good links here?  Any high profile stuff?



-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread László Török
Hi,

I believe Clojure's original mission has been giving you tools for handling
concurrency[1] in your programs in a sane way.
However, with the advent of Reducers[2], the landscape is changing quite a
bit.
If you're interested in the concurrency vs. parallelism terminology and
what language constructs supporting them are available, [3] should give you
a very good overview.
I understand, I haven't actually answered you're question, but I wasn't
sure whether you're referring to concurrency or parallelism. :)

Las

[1] http://clojure.org/concurrent_programming
[2]
http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
[3]
http://clojure-doc.org/articles/language/concurrency_and_parallelism.html


2013/11/6 Dave Tenny dave.te...@gmail.com

 As a person who has recently been dabbling with clojure for evaluation
 purposes I wondered if anybody wanted to post some links about parallel
 clojure apps that have been clear and easy parallelism wins for the types
 of applications that clojure was designed for.  (To contrast the lengthy
 discussion and analysis of this topic that is *hopefully* the exception and
 not the rule).

 Any good links here?  Any high profile stuff?



  --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.




-- 
László Török

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Michael Klishin
2013/11/6 Dave Tenny dave.te...@gmail.com

 (To contrast the lengthy discussion and analysis of this topic that is
 *hopefully* the exception and not the rule)


Some of the comments reveal that part of the problem is in part with JVM
memory allocator
which has its throughput limits.

There are known large commercial systems entirely in Clojure that
demonstrate excellent latency
(and, likely, throughput). I believe Factual, Runa/Stables Labs and
Prismatic posted various numbers here but don't
have the links handy.

Feel free to start a separate thread to collect some feedback.
-- 
MK

http://github.com/michaelklishin
http://twitter.com/michaelklishin

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: abysmal multicore performance, especially on AMD processors

2013-11-06 Thread Timothy Baldridge
You should also specify how many cores you plan on devoting to your
application. Notice that most of this discussion has been about JVM apps
running on machines with 32 cores. Systems like this aren't exactly common
in my line of work (where we tend to run greater numbers of smaller servers
using virtualization), but perhaps they are in yours.

Timothy Baldridge


On Wed, Nov 6, 2013 at 8:45 AM, Michael Klishin michael.s.klis...@gmail.com
 wrote:

 2013/11/6 Dave Tenny dave.te...@gmail.com

 (To contrast the lengthy discussion and analysis of this topic that is
 *hopefully* the exception and not the rule)


 Some of the comments reveal that part of the problem is in part with JVM
 memory allocator
 which has its throughput limits.

 There are known large commercial systems entirely in Clojure that
 demonstrate excellent latency
 (and, likely, throughput). I believe Factual, Runa/Stables Labs and
 Prismatic posted various numbers here but don't
 have the links handy.

 Feel free to start a separate thread to collect some feedback.
 --
 MK

 http://github.com/michaelklishin
 http://twitter.com/michaelklishin

 --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.




-- 
“One of the main causes of the fall of the Roman Empire was that–lacking
zero–they had no way to indicate successful termination of their C
programs.”
(Robert Firth)

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: abysmal multicore performance, especially on AMD processors

2013-09-27 Thread Wm. Josiah Erikson
Interesting! If that is true of Java (I don't know Java at all), then your
argument seems plausible. Cache-to-main-memory writes still take many more
CPU cycles (an order of magnitude more, last I knew) than
processor-to-cache. I don't think it's so much a bandwidth issue as
latency, AFAIK. Thanks for thinking about this more, so long after the
fact. We still see the issue.
On Sep 26, 2013 11:43 PM, Andy Fingerhut andy.finger...@gmail.com wrote:

 Adding to this thread from almost a year ago.  I don't have conclusive
 proof with experiments to show right now, but I do have some experiments
 that have led me to what I think is a plausible cause of not just Clojure
 programs running more slowly when multi-threaded than when single-threaded,
 but any programs running on JVM's memory model doing so.  Qualification: My
 explanation would be true only for multi-threaded programs running on the
 JVM that store significant amounts of data in memory, even if the data
 written by each thread is only read by that thread, and there is no locking
 or other inter-thread communication, i.e. for embarrasingly parallel
 problems.

 Outline of the argument:

 When you start a thread, and then wait for the thread to complete, the JVM
 memory model requires all loads and stores to satisfy certain
 restrictions.  One of these is that any store done before the thread is
 created should 'happen before' the thread start, and thus the updated
 stored values must be visible to the new thread.  'Visible' here means that
 the thread doing the store must cause the CPU it is running on to update
 main memory from whatever locally modified values it has written into its
 local cache.  That rule isn't so relevant to my argument.

 The one that is relevant is that any store performed by the thread is
 considered to 'happen before' a join operation on the thread.  Thus any
 store done by a thread must be written back to main memory, *even if the
 store is to a JVM object that later becomes garbage*.

 So imagine a single-threaded program that creates X bytes of garbage while
 it runs.  Those X bytes will definitely be written to the CPU's local
 cache, but they will only be written to main memory if the cache space runs
 out before the garbage collector does its work and allows that memory to be
 reused for allocations.  The CPU-to-local-cache bandwidth in many modern
 systems is significantly faster than local-cache-to-main-memory bandwidth.

 Now take that same program and spread its work across 2 or more threads,
 with a join at the end of each one.  For the sake of example, say that each
 thread will write X/N bytes of data while it runs.  Even if the only data
 needed later in the rest of the program is a single Long object, for
 example, all of those X/N bytes of data will be copied from the local cache
 to main memory (if that did not already happen before the thread
 terminated).

 If the number of threads is large enough, the amount of data written from
 all local caches to main memory can be higher in the multi-threaded case
 than in the single-threaded case.

 Anyway, that is my hypothesis about what could be happening here.  It
 isn't Clojure-specific, but it can be exacerbated by the common behavior of
 a lot of Clojure code to allocate significant amounts of memory that
 becomes garbage.

 Andy



 On Wed, Jan 30, 2013 at 6:20 PM, Lee Spector lspec...@hampshire.eduwrote:


 FYI we had a bit of a discussion about this at a meetup in Amherst MA
 yesterday, and while I'm not sufficiently on top of the JVM or system
 issues to have briefed everyone on all of the details there has been a
 little of followup since the discussion, including results of some
 different experiments by Chas Emerick, at:
 http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382

  -Lee

 On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote:
 
  Apologies for my very-slow reply here.  I keep thinking that I’ll have
  more time to look into this issue, and keep having other things
  requiring my attention.  And on top of that, I’ve temporarily lost the
  many-way AMD system I was using as a test-bed.
 
  I very much want to see if I can get my hands on an Intel system to
  compare to.  My AMD system is in theory 32-way – two physical CPUs, each
  with 16 cores.  However, Linux reports (via /proc/cpuinfo) the cores in
  groups of 8 (“cpu cores : 8” etc).  And something very strange happens
  when extending parallelism beyond 8-way...  I ran several experiments
  using a version of your whole-application benchmark I modified to
  control the level of parallelism.  At parallelism 9+, the real time it
  takes to complete the benchmark hardly budges, but the user/CPU time
  increases linearly with the level of parallelism!  As far as I can tell,
  multi-processor AMD *is* a NUMA architecture, which might potentially
  explain things.  But enabling the JVM NUMA options doesn’t seem to
  affect the benchmark.
 
  I 

Re: abysmal multicore performance, especially on AMD processors

2013-09-27 Thread Neale Swinnerton
The disruptor project from LMAX has wrestled with these sort of issues at
length and achieved astounding levels of performance on the JVM

Martin Thompson, the original author of the disruptor, is a leading light
in the JVM performance space, his mechanical sympathy blog is a goldmine of
information and a must read for anyone wanting to understand the JVM /
hardware interface.

Find more info at:

http://mechanical-sympathy.blogspot.co.uk/

http://lmax-exchange.github.io/disruptor/


*Neale Swinnerton*
{t: @sw1nn https://twitter.com/#!/sw1nn, w: sw1nn.com }


On 27 September 2013 12:29, Wm. Josiah Erikson wmjos...@gmail.com wrote:

 Interesting! If that is true of Java (I don't know Java at all), then your
 argument seems plausible. Cache-to-main-memory writes still take many more
 CPU cycles (an order of magnitude more, last I knew) than
 processor-to-cache. I don't think it's so much a bandwidth issue as
 latency, AFAIK. Thanks for thinking about this more, so long after the
 fact. We still see the issue.
  On Sep 26, 2013 11:43 PM, Andy Fingerhut andy.finger...@gmail.com
 wrote:

 Adding to this thread from almost a year ago.  I don't have conclusive
 proof with experiments to show right now, but I do have some experiments
 that have led me to what I think is a plausible cause of not just Clojure
 programs running more slowly when multi-threaded than when single-threaded,
 but any programs running on JVM's memory model doing so.  Qualification: My
 explanation would be true only for multi-threaded programs running on the
 JVM that store significant amounts of data in memory, even if the data
 written by each thread is only read by that thread, and there is no locking
 or other inter-thread communication, i.e. for embarrasingly parallel
 problems.

 Outline of the argument:

 When you start a thread, and then wait for the thread to complete, the
 JVM memory model requires all loads and stores to satisfy certain
 restrictions.  One of these is that any store done before the thread is
 created should 'happen before' the thread start, and thus the updated
 stored values must be visible to the new thread.  'Visible' here means that
 the thread doing the store must cause the CPU it is running on to update
 main memory from whatever locally modified values it has written into its
 local cache.  That rule isn't so relevant to my argument.

 The one that is relevant is that any store performed by the thread is
 considered to 'happen before' a join operation on the thread.  Thus any
 store done by a thread must be written back to main memory, *even if the
 store is to a JVM object that later becomes garbage*.

 So imagine a single-threaded program that creates X bytes of garbage
 while it runs.  Those X bytes will definitely be written to the CPU's local
 cache, but they will only be written to main memory if the cache space runs
 out before the garbage collector does its work and allows that memory to be
 reused for allocations.  The CPU-to-local-cache bandwidth in many modern
 systems is significantly faster than local-cache-to-main-memory bandwidth.

 Now take that same program and spread its work across 2 or more threads,
 with a join at the end of each one.  For the sake of example, say that each
 thread will write X/N bytes of data while it runs.  Even if the only data
 needed later in the rest of the program is a single Long object, for
 example, all of those X/N bytes of data will be copied from the local cache
 to main memory (if that did not already happen before the thread
 terminated).

 If the number of threads is large enough, the amount of data written from
 all local caches to main memory can be higher in the multi-threaded case
 than in the single-threaded case.

 Anyway, that is my hypothesis about what could be happening here.  It
 isn't Clojure-specific, but it can be exacerbated by the common behavior of
 a lot of Clojure code to allocate significant amounts of memory that
 becomes garbage.

 Andy



 On Wed, Jan 30, 2013 at 6:20 PM, Lee Spector lspec...@hampshire.eduwrote:


 FYI we had a bit of a discussion about this at a meetup in Amherst MA
 yesterday, and while I'm not sufficiently on top of the JVM or system
 issues to have briefed everyone on all of the details there has been a
 little of followup since the discussion, including results of some
 different experiments by Chas Emerick, at:
 http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382

  -Lee

 On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote:
 
  Apologies for my very-slow reply here.  I keep thinking that I’ll have
  more time to look into this issue, and keep having other things
  requiring my attention.  And on top of that, I’ve temporarily lost the
  many-way AMD system I was using as a test-bed.
 
  I very much want to see if I can get my hands on an Intel system to
  compare to.  My AMD system is in theory 32-way – two physical CPUs,
 each
  with 16 cores.  However, 

Re: abysmal multicore performance, especially on AMD processors

2013-09-26 Thread Andy Fingerhut
Adding to this thread from almost a year ago.  I don't have conclusive
proof with experiments to show right now, but I do have some experiments
that have led me to what I think is a plausible cause of not just Clojure
programs running more slowly when multi-threaded than when single-threaded,
but any programs running on JVM's memory model doing so.  Qualification: My
explanation would be true only for multi-threaded programs running on the
JVM that store significant amounts of data in memory, even if the data
written by each thread is only read by that thread, and there is no locking
or other inter-thread communication, i.e. for embarrasingly parallel
problems.

Outline of the argument:

When you start a thread, and then wait for the thread to complete, the JVM
memory model requires all loads and stores to satisfy certain
restrictions.  One of these is that any store done before the thread is
created should 'happen before' the thread start, and thus the updated
stored values must be visible to the new thread.  'Visible' here means that
the thread doing the store must cause the CPU it is running on to update
main memory from whatever locally modified values it has written into its
local cache.  That rule isn't so relevant to my argument.

The one that is relevant is that any store performed by the thread is
considered to 'happen before' a join operation on the thread.  Thus any
store done by a thread must be written back to main memory, *even if the
store is to a JVM object that later becomes garbage*.

So imagine a single-threaded program that creates X bytes of garbage while
it runs.  Those X bytes will definitely be written to the CPU's local
cache, but they will only be written to main memory if the cache space runs
out before the garbage collector does its work and allows that memory to be
reused for allocations.  The CPU-to-local-cache bandwidth in many modern
systems is significantly faster than local-cache-to-main-memory bandwidth.

Now take that same program and spread its work across 2 or more threads,
with a join at the end of each one.  For the sake of example, say that each
thread will write X/N bytes of data while it runs.  Even if the only data
needed later in the rest of the program is a single Long object, for
example, all of those X/N bytes of data will be copied from the local cache
to main memory (if that did not already happen before the thread
terminated).

If the number of threads is large enough, the amount of data written from
all local caches to main memory can be higher in the multi-threaded case
than in the single-threaded case.

Anyway, that is my hypothesis about what could be happening here.  It isn't
Clojure-specific, but it can be exacerbated by the common behavior of a lot
of Clojure code to allocate significant amounts of memory that becomes
garbage.

Andy



On Wed, Jan 30, 2013 at 6:20 PM, Lee Spector lspec...@hampshire.edu wrote:


 FYI we had a bit of a discussion about this at a meetup in Amherst MA
 yesterday, and while I'm not sufficiently on top of the JVM or system
 issues to have briefed everyone on all of the details there has been a
 little of followup since the discussion, including results of some
 different experiments by Chas Emerick, at:
 http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382

  -Lee

 On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote:
 
  Apologies for my very-slow reply here.  I keep thinking that I’ll have
  more time to look into this issue, and keep having other things
  requiring my attention.  And on top of that, I’ve temporarily lost the
  many-way AMD system I was using as a test-bed.
 
  I very much want to see if I can get my hands on an Intel system to
  compare to.  My AMD system is in theory 32-way – two physical CPUs, each
  with 16 cores.  However, Linux reports (via /proc/cpuinfo) the cores in
  groups of 8 (“cpu cores : 8” etc).  And something very strange happens
  when extending parallelism beyond 8-way...  I ran several experiments
  using a version of your whole-application benchmark I modified to
  control the level of parallelism.  At parallelism 9+, the real time it
  takes to complete the benchmark hardly budges, but the user/CPU time
  increases linearly with the level of parallelism!  As far as I can tell,
  multi-processor AMD *is* a NUMA architecture, which might potentially
  explain things.  But enabling the JVM NUMA options doesn’t seem to
  affect the benchmark.
 
  I think next steps are two-fold: (1) examine parallelism vs real  CPU
  time on an Intel system, and (2) attempt to reproduce the observed
  behavior in pure Java.  I’m keeping my fingers crossed that I’ll have
  some time to look at this more soon, but I’m honestly not very hopeful.
 
  In the mean time, I hope you’ve managed to exploit multi-process
  parallelism to run more efficiently?
 
  -Marshall

 --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure 

multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
Keeping the discussion here would make sense, esp. in light of meetup.com's 
horrible discussion board.

I don't have a lot to offer on the JVM/Clojure-specific problem beyond what I 
wrote in that meetup thread, but Lee's challenge(s) were too hard to resist:

 Would your conclusion be something like 'Intensive list processing can't 
 currently be done in Java (or Clojure/JVM) in a way that takes reasonable 
 advantage of multiple cores.'?


 I see that rewrite without lists might be the only way out, but that'd be 
 a bummer. Clojure is a multicore Lisp; if you can't use it for multicore list 
 processing then... sigh.

The nature of the `burn` program is such that I'm skeptical of the ability of 
any garbage-collected runtime (lispy or not) to scale its operation across 
multiple threads.  It generates a huge amount of garbage that is held for a 
very long time (in GC terms), thus producing a great deal of contention on the 
heap between the active job threads and the GC implementation constantly having 
to clean up after them, compact the results, etc.  The workload is not 
CPU-bound, so I don't see it being parallelized effectively.

I'd be very interested in seeing an implementation for a runtime that proves me 
wrong, i.e. can attain significant performance benefits by running the e.g. 8 
`burn` jobs across multiple threads.

In an attempt to do just that (prove myself wrong), I thought I'd golf the 
`burn` program (again; onlookers can go look at the Java implementation I 
knocked out a few days ago here: https://gist.github.com/4682554) on Racket, 
presumably as well-suited to multicore list processing as any other.  The 
results:

https://gist.github.com/4682453

Please forgive my perhaps pitiful Racket-fu.  I'm sure there's threadpool and 
CAS libraries available, but it was entertaining to hack away with the 
threading and semaphore primitives.  If I've goofed something up badly (it's 
been some time since I've schemed), please speak up.

The Racket and Java implementations are effectively equivalent:

`java -server -Xmx2g cemerick.burn $THREAD_COUNT`  (oracle java 1.7.0_09)
1 thread: 61s
2 threads: 126s

`java -server -Xmx2g cemerick.burn $THREAD_COUNT`  (oracle java 1.7.0_09)
1 thread: 47s
2 threads: 92s

`time ./racket -t ~/burn.scm -m $THREAD_COUNT`  (racket 5.3.1)
1 thread: 45s
2 threads: 126s

The above was run on OS X 10.7.5 with 4GB of physical memory.  If someone knows 
of ways to get better perf on racket, let me know.  I tried running from the 
results of `raco make` and such, but saw no improvements; I guess it is 
strictly producing bytecode that is later JIT'ed, and not doing any kind of 
additional AOT optimizations...

It'd be interesting to see timings from different operating systems, and very 
interesting to see timings from running burn.scm on other schemes, or a Common 
Lisp implementation and timings.

Cheers,

- Chas

On Jan 30, 2013, at 9:20 PM, Lee Spector wrote:

 
 FYI we had a bit of a discussion about this at a meetup in Amherst MA 
 yesterday, and while I'm not sufficiently on top of the JVM or system issues 
 to have briefed everyone on all of the details there has been a little of 
 followup since the discussion, including results of some different 
 experiments by Chas Emerick, at: 
 http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382
 
 -Lee
 
 On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote:
 
 Apologies for my very-slow reply here.  I keep thinking that I’ll have
 more time to look into this issue, and keep having other things
 requiring my attention.  And on top of that, I’ve temporarily lost the
 many-way AMD system I was using as a test-bed.
 
 I very much want to see if I can get my hands on an Intel system to
 compare to.  My AMD system is in theory 32-way – two physical CPUs, each
 with 16 cores.  However, Linux reports (via /proc/cpuinfo) the cores in
 groups of 8 (“cpu cores : 8” etc).  And something very strange happens
 when extending parallelism beyond 8-way...  I ran several experiments
 using a version of your whole-application benchmark I modified to
 control the level of parallelism.  At parallelism 9+, the real time it
 takes to complete the benchmark hardly budges, but the user/CPU time
 increases linearly with the level of parallelism!  As far as I can tell,
 multi-processor AMD *is* a NUMA architecture, which might potentially
 explain things.  But enabling the JVM NUMA options doesn’t seem to
 affect the benchmark.
 
 I think next steps are two-fold: (1) examine parallelism vs real  CPU
 time on an Intel system, and (2) attempt to reproduce the observed
 behavior in pure Java.  I’m keeping my fingers crossed that I’ll have
 some time to look at this more soon, but I’m honestly not very hopeful.
 
 In the mean time, I hope you’ve managed to exploit multi-process
 parallelism to run more efficiently?
 
 -Marshall
 
 -- 
 -- 
 You received this message because you are subscribed 

Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Marshall Bockrath-Vandegrift
Chas Emerick c...@cemerick.com writes:

 Keeping the discussion here would make sense, esp. in light of
 meetup.com's horrible discussion board.

Excellent.  Solves the problem of deciding the etiquette of jumping on
the meetup board for a meetup one has never been involved in.  :-)

 The nature of the `burn` program is such that I'm skeptical of the
 ability of any garbage-collected runtime (lispy or not) to scale its
 operation across multiple threads.

Bringing you up to speed on this very long thread, I don’t believe the
original `burn` benchmark is a GC issue, due to results cameron first
reported here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ

I that narrowed to what I still believe to be the explanation here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ

And have more compact demonstration benchmark results here:

https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ

I haven’t been able to produce those results in a minimal Java-only test
case, though.

Then Wm. Josiah posted a full-application benchmark, which appears to
have entirely different performance problems from the synthetic `burn`
benchmark.  I’d rejected GC as the cause for the slowdown there too, but
ATM can’t recall why or what I tested, so GC may definitely be a
candidate to re-examine:

https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ

Quite looking forward to additional insight...

-Marshall

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Chas Emerick
On Jan 31, 2013, at 9:23 AM, Marshall Bockrath-Vandegrift wrote:

 Chas Emerick c...@cemerick.com writes:
 
 The nature of the `burn` program is such that I'm skeptical of the
 ability of any garbage-collected runtime (lispy or not) to scale its
 operation across multiple threads.
 
 Bringing you up to speed on this very long thread, I don’t believe the
 original `burn` benchmark is a GC issue, due to results cameron first
 reported here:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/83uXZjLi3iAJ
 
 I that narrowed to what I still believe to be the explanation here:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/jd8dpmzEtEYJ
 
 And have more compact demonstration benchmark results here:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/tCCkjXxTUMEJ
 
 I haven’t been able to produce those results in a minimal Java-only test
 case, though.
 
 Then Wm. Josiah posted a full-application benchmark, which appears to
 have entirely different performance problems from the synthetic `burn`
 benchmark.  I’d rejected GC as the cause for the slowdown there too, but
 ATM can’t recall why or what I tested, so GC may definitely be a
 candidate to re-examine:
 
https://groups.google.com/d/msg/clojure/48W2eff3caU/K224Aqwkn5YJ
 
 Quite looking forward to additional insight...

Yeah, the thread is far too large for me to reasonably digest (and there's 
inevitably a lot of noise in rounds of microbenchmarking golf ;-)

BTW, I just realized I copy/pasted the wrong command for the faster of the Java 
implementation benchmarks in the previous mail; it used -XX:-UseParallelGC:

`java -server -Xmx2g -XX:-UseParallelGC cemerick.burn $THREAD_COUNT`

I've not looked at clojush at all; it is AFAICT a complex application in its 
own right, and I wouldn't know where to start.  My understanding is that the 
`burn` function is considered representative of the bulk of operations in 
clojush, but I'm happy to assume that that's not the case.  There's likely a 
number of things that go into the results of the `burn` benchmark, nevermind 
the apparent performance of clojush.

Here's what I know:

* Two separate implementations (Java and Racket) exhibit very similar multicore 
performance characteristics.  In particular, the Java implementation would not 
suffer from any particular megamorphic inefficiencies, since no such calls are 
made in that program.  Racket is far more of a black box to me than the JVM, 
but it does not provide much of any polymorphism at all, so JIT-related issues 
are presumably not in scope there.

* Broadly speaking, heap allocation and its evil twin, GC, sabotages 
parallelism and performance in general.  Functional programming with immutable 
values has been made possible in large part by the gains registered by 
persistent data structures, structural sharing, lazy evaluation, and so on; 
without them, we're back to copy-on-write, which is roughly what `burn` 
represents in grand style.

Now, all this digital ink spilled, and just for kicks, I go back to run the 
Clojure `burn` again (https://gist.github.com/4683413 in a 1.5.0-RC2 REPL 
started with jvm args of `-Xmx2g -XX:-UseParallelGC`):

1 thread: 38s
2 threads: 32s
4 threads: 23s

Hardly a 2x or 4x improvement, but I had previously obtained badly degrading 
timings corresponding to those in my previous mail.  Microbenchmarking sucks.   
At least I got to twiddle with Racket again.  :-P

Cheers,

- Chas

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: multicore list processing (was Re: abysmal multicore performance, especially on AMD processors)

2013-01-31 Thread Lee Spector

On Jan 31, 2013, at 10:15 AM, Chas Emerick wrote:
 
 Then Wm. Josiah posted a full-application benchmark, which appears to
 have entirely different performance problems from the synthetic `burn`
 benchmark.  I’d rejected GC as the cause for the slowdown there too, but
 ATM can’t recall why or what I tested, so GC may definitely be a
 candidate to re-examine:
 
 
 I've not looked at clojush at all; it is AFAICT a complex application in its 
 own right, and I wouldn't know where to start.  My understanding is that the 
 `burn` function is considered representative of the bulk of operations in 
 clojush, but I'm happy to assume that that's not the case.  There's likely a 
 number of things that go into the results of the `burn` benchmark, nevermind 
 the apparent performance of clojush.

FWIW I wrote the burn benchmark because lots of intensive list manipulation 
is at the heart of our real applications, and that seemed to be a nicely 
minimal way to see how that could scale across cores.

Our real applications may indeed raise other issues too, but getting lots of 
intensive list manipulation to scale well is bound to be good.

FWIW the full-application benchmark that Josiah posted was made deterministic 
in a way that means that big chunks of our code won't actually be executing. 
The system generates and interprets lots of Push programs, and if I recall 
correctly the benchmark hardcodes all of the Push programs to be the same 
constant program, meaning that not all of the Push instructions will run, etc. 
It'd be trickier to get the real mix of instructions in a way that makes for 
an easily reproducible benchmark, and this benchmark is at least testing the 
basic interpreter and population-level infrastructure. If we can get this to 
scale reasonably then we could try to see how much that helps under 
more-completely real conditions.

 -Lee

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Marshall Bockrath-Vandegrift
Wm. Josiah Erikson wmjos...@gmail.com writes:

 Am I reading this right that this is actually a Java problem, and not
 clojure-specific? Wouldn't the rest of the Java community have noticed
 this? Or maybe massive parallelism in this particular way isn't
 something commonly done with Java in the industry?

 Thanks for the patches though - it's nice to see some improvement...
 I'll be fascinated to see how this turns out in the end. Have we found
 a large Java bug?

Apologies for my very-slow reply here.  I keep thinking that I’ll have
more time to look into this issue, and keep having other things
requiring my attention.  And on top of that, I’ve temporarily lost the
many-way AMD system I was using as a test-bed.

I very much want to see if I can get my hands on an Intel system to
compare to.  My AMD system is in theory 32-way – two physical CPUs, each
with 16 cores.  However, Linux reports (via /proc/cpuinfo) the cores in
groups of 8 (“cpu cores : 8” etc).  And something very strange happens
when extending parallelism beyond 8-way...  I ran several experiments
using a version of your whole-application benchmark I modified to
control the level of parallelism.  At parallelism 9+, the real time it
takes to complete the benchmark hardly budges, but the user/CPU time
increases linearly with the level of parallelism!  As far as I can tell,
multi-processor AMD *is* a NUMA architecture, which might potentially
explain things.  But enabling the JVM NUMA options doesn’t seem to
affect the benchmark.

I think next steps are two-fold: (1) examine parallelism vs real  CPU
time on an Intel system, and (2) attempt to reproduce the observed
behavior in pure Java.  I’m keeping my fingers crossed that I’ll have
some time to look at this more soon, but I’m honestly not very hopeful.

In the mean time, I hope you’ve managed to exploit multi-process
parallelism to run more efficiently?

-Marshall

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Lee Spector

FYI we had a bit of a discussion about this at a meetup in Amherst MA 
yesterday, and while I'm not sufficiently on top of the JVM or system issues to 
have briefed everyone on all of the details there has been a little of followup 
since the discussion, including results of some different experiments by Chas 
Emerick, at: 
http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382

 -Lee

On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote:
 
 Apologies for my very-slow reply here.  I keep thinking that I’ll have
 more time to look into this issue, and keep having other things
 requiring my attention.  And on top of that, I’ve temporarily lost the
 many-way AMD system I was using as a test-bed.
 
 I very much want to see if I can get my hands on an Intel system to
 compare to.  My AMD system is in theory 32-way – two physical CPUs, each
 with 16 cores.  However, Linux reports (via /proc/cpuinfo) the cores in
 groups of 8 (“cpu cores : 8” etc).  And something very strange happens
 when extending parallelism beyond 8-way...  I ran several experiments
 using a version of your whole-application benchmark I modified to
 control the level of parallelism.  At parallelism 9+, the real time it
 takes to complete the benchmark hardly budges, but the user/CPU time
 increases linearly with the level of parallelism!  As far as I can tell,
 multi-processor AMD *is* a NUMA architecture, which might potentially
 explain things.  But enabling the JVM NUMA options doesn’t seem to
 affect the benchmark.
 
 I think next steps are two-fold: (1) examine parallelism vs real  CPU
 time on an Intel system, and (2) attempt to reproduce the observed
 behavior in pure Java.  I’m keeping my fingers crossed that I’ll have
 some time to look at this more soon, but I’m honestly not very hopeful.
 
 In the mean time, I hope you’ve managed to exploit multi-process
 parallelism to run more efficiently?
 
 -Marshall

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: abysmal multicore performance, especially on AMD processors

2013-01-30 Thread Andy Fingerhut
Josiah mentioned requesting a free trial of the ZIng JVM.  Did you ever get 
access to that, and were able to try your code running on that?

Again, I have no direct experience with their product to guarantee you better 
results -- just that I've heard good things about their ability to handle 
concurrent workloads with different GC than most JVMs.

Andy

On Jan 30, 2013, at 6:20 PM, Lee Spector wrote:

 
 FYI we had a bit of a discussion about this at a meetup in Amherst MA 
 yesterday, and while I'm not sufficiently on top of the JVM or system issues 
 to have briefed everyone on all of the details there has been a little of 
 followup since the discussion, including results of some different 
 experiments by Chas Emerick, at: 
 http://www.meetup.com/Functional-Programming-Connoisseurs/messages/boards/thread/30946382
 
 -Lee
 
 On Jan 30, 2013, at 8:39 PM, Marshall Bockrath-Vandegrift wrote:
 
 Apologies for my very-slow reply here.  I keep thinking that I’ll have
 more time to look into this issue, and keep having other things
 requiring my attention.  And on top of that, I’ve temporarily lost the
 many-way AMD system I was using as a test-bed.
 
 I very much want to see if I can get my hands on an Intel system to
 compare to.  My AMD system is in theory 32-way – two physical CPUs, each
 with 16 cores.  However, Linux reports (via /proc/cpuinfo) the cores in
 groups of 8 (“cpu cores : 8” etc).  And something very strange happens
 when extending parallelism beyond 8-way...  I ran several experiments
 using a version of your whole-application benchmark I modified to
 control the level of parallelism.  At parallelism 9+, the real time it
 takes to complete the benchmark hardly budges, but the user/CPU time
 increases linearly with the level of parallelism!  As far as I can tell,
 multi-processor AMD *is* a NUMA architecture, which might potentially
 explain things.  But enabling the JVM NUMA options doesn’t seem to
 affect the benchmark.
 
 I think next steps are two-fold: (1) examine parallelism vs real  CPU
 time on an Intel system, and (2) attempt to reproduce the observed
 behavior in pure Java.  I’m keeping my fingers crossed that I’ll have
 some time to look at this more soon, but I’m honestly not very hopeful.
 
 In the mean time, I hope you’ve managed to exploit multi-process
 parallelism to run more efficiently?
 
 -Marshall

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: abysmal multicore performance, especially on AMD processors

2013-01-10 Thread Wm. Josiah Erikson
Am I reading this right that this is actually a Java problem, and not
clojure-specific? Wouldn't the rest of the Java community have noticed
this? Or maybe massive parallelism in this particular way isn't something
commonly done with Java in the industry?

Thanks for the patches though - it's nice to see some improvement... I'll
be fascinated to see how this turns out in the end. Have we found a large
Java bug?

On Sun, Dec 30, 2012 at 9:28 PM, cameron cdor...@gmail.com wrote:

 I've posted a patch with some changes here (
 https://gist.github.com/4416803), it includes the record change here  and
 a small change to interpret-instruction, the benchmark runs  2x the
 default as it did for Marshall.
 The patch also modifies the main loop to use a thread pool instead of
 agents and allows you to set the number of threads, this might help
 diagnosing the parallel performance issue.

 On the modified benchmark I'm seeing ~4x speedup with the parallel version
 on an 8 core machine and the profiler reports that the parallel version is
 using twice as much cpu time.

 I also had another look at the native calls issue  modified the clojure
 runtime to avoid most to the calls the profiler said were taking
 significantly more time in the parallel version, it did speed things up but
 only by ~6%, not the large margin the profiling results had led me to
 believe were possible, it looks like the profiler overstates these methods
 times. The modified clojure 1.5 is available here
 https://github.com/cdorrat/clojure/commit/dfb5f99eb5d0a45165978e079284bab1f25bd79fif
  anyone's interested

 YourKit is reporting that a number of clojure.core functions are taking
 longer in the parallel version than the serial and they all seem to be
 methods that have one or more instanceof or instance? calls but given the
 results above I'm not sure how much weight to give this.

 It's seems the elephant is still in the room and responsible to ~50% of
 the cpu time :)

 Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-30 Thread cameron
I've posted a patch with some changes here 
(https://gist.github.com/4416803), it includes the record change here  and 
a small change to interpret-instruction, the benchmark runs  2x the 
default as it did for Marshall.
The patch also modifies the main loop to use a thread pool instead of 
agents and allows you to set the number of threads, this might help 
diagnosing the parallel performance issue.

On the modified benchmark I'm seeing ~4x speedup with the parallel version 
on an 8 core machine and the profiler reports that the parallel version is 
using twice as much cpu time. 

I also had another look at the native calls issue  modified the clojure 
runtime to avoid most to the calls the profiler said were taking 
significantly more time in the parallel version, it did speed things up but 
only by ~6%, not the large margin the profiling results had led me to 
believe were possible, it looks like the profiler overstates these methods 
times. The modified clojure 1.5 is available here 
https://github.com/cdorrat/clojure/commit/dfb5f99eb5d0a45165978e079284bab1f25bd79f
 
if anyone's interested

YourKit is reporting that a number of clojure.core functions are taking 
longer in the parallel version than the serial and they all seem to be 
methods that have one or more instanceof or instance? calls but given the 
results above I'm not sure how much weight to give this. 

It's seems the elephant is still in the room and responsible to ~50% of the 
cpu time :)

Cameron.

On Saturday, December 22, 2012 10:57:28 AM UTC+11, Marshall 
Bockrath-Vandegrift wrote:

 Lee Spector lspe...@hampshire.edu javascript: writes: 

  FWIW I used records for push-states at one point but did not observe a 
  speedup and it required much messier code, so I reverted to 
  struct-maps. But maybe I wasn't doing the right timings. I'm curious 
  about how you changed to records without the messiness. I'll include 
  below my sig the way that I had to do it... maybe you can show me what 
  you did instead. 

 I just double-checked, and I definitely see a 2x speedup on Josiah’s 
 benchmark.  That may still be synthetic, of course.  Here’s what I did: 

 (eval `(defrecord ~'PushState [~'trace ~@(map (comp symbol name) 
 push-types)])) 
 
 (let [empty-state (map-PushState {})] 
   (defn make-push-state 
 Returns an empty push state. 
 [] empty-state)) 

  Still, I guess the gorilla in the room, which is eating the multicore 
  performance, hasn't yet been found. 

 No, not yet...  I’ve become obsessed with figuring it out though, so 
 still slogging at it. 

 -Marshall 



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread cameron
Hi Lee,
  I've done some more digging and seem to have found the root of the 
problem, 
it seems that java native methods are much slower when called in parallel.
The following code illustrates the problem: 

(letfn [(time-native [f]  
 (let [c (class [])] 
   (time (dorun (f
 (fn [_] (.isArray c))
 (range 1000))]
  (println Sequential Test:)
  (time-native map)
  (println Parallel Test:)
  (time-native pmap))

On a dual quad-core xeon box I get the following results:
  Sequential Test:
   Elapsed time: 1054.807725 msecs
  Parallel Test:
   Elapsed time: 15302.605697 msecs

ie. the code executed in parallel was 15 times slower than the sequential 
version.
The same can be seen with isInstance and isArray members of java.lang.Class.

I'm not sure where to go from here, these functions are frequently used by 
clojure.core
Perhaps someone with more JVM implementation knowledge can help?

Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread Leonardo Borges
In that case isn't context switching dominating your test?

.isArray isn't expensive enough to warrant the use of pmap

Leonardo Borges
www.leonardoborges.com
On Dec 29, 2012 10:29 AM, cameron cdor...@gmail.com wrote:

 Hi Lee,
   I've done some more digging and seem to have found the root of the
 problem,
 it seems that java native methods are much slower when called in parallel.
 The following code illustrates the problem:

 (letfn [(time-native [f]
  (let [c (class [])]
(time (dorun (f
  (fn [_] (.isArray c))
  (range 1000))]
   (println Sequential Test:)
   (time-native map)
   (println Parallel Test:)
   (time-native pmap))

 On a dual quad-core xeon box I get the following results:
   Sequential Test:
Elapsed time: 1054.807725 msecs
   Parallel Test:
Elapsed time: 15302.605697 msecs

 ie. the code executed in parallel was 15 times slower than the sequential
 version.
 The same can be seen with isInstance and isArray members of
 java.lang.Class.

 I'm not sure where to go from here, these functions are frequently used by
 clojure.core
 Perhaps someone with more JVM implementation knowledge can help?

 Cameron.

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-28 Thread cameron
No, it's not the context switching, changing isArray (a native method) to 
getAnnotations (a normal jvm method) gives the same time for both the 
parallel and serial version.

Cameron.

On Saturday, December 29, 2012 10:34:42 AM UTC+11, Leonardo Borges wrote:

 In that case isn't context switching dominating your test? 

 .isArray isn't expensive enough to warrant the use of pmap

 Leonardo Borges
 www.leonardoborges.com
 On Dec 29, 2012 10:29 AM, cameron cdo...@gmail.com javascript: 
 wrote:

 Hi Lee,
   I've done some more digging and seem to have found the root of the 
 problem, 
 it seems that java native methods are much slower when called in parallel.
 The following code illustrates the problem: 

 (letfn [(time-native [f]  
  (let [c (class [])] 
(time (dorun (f
  (fn [_] (.isArray c))
  (range 1000))]
   (println Sequential Test:)
   (time-native map)
   (println Parallel Test:)
   (time-native pmap))

 On a dual quad-core xeon box I get the following results:
   Sequential Test:
Elapsed time: 1054.807725 msecs
   Parallel Test:
Elapsed time: 15302.605697 msecs

 ie. the code executed in parallel was 15 times slower than the sequential 
 version.
 The same can be seen with isInstance and isArray members of 
 java.lang.Class.

 I'm not sure where to go from here, these functions are frequently used 
 by clojure.core
 Perhaps someone with more JVM implementation knowledge can help?

 Cameron.

 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clo...@googlegroups.comjavascript:
 Note that posts from new members are moderated - please be patient with 
 your first post.
 To unsubscribe from this group, send email to
 clojure+u...@googlegroups.com javascript:
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-24 Thread cameron
I've been moving house for the last week or so but I'll also give the 
benchmark another look.
My initial profiling seemed to show that the parallel version was spending 
a significant amount of time in java.lang.isArray,
clojush.pushstate/stack-ref is calling nth on the result of cons, since it 
isn't an instance of clojure.lang.Indexed nth resorts to a series of tests 
before returning the value (including isArray).

I changed clojush.pushstate/push-item implementation to
   (assoc state type (vec (cons value (type state ;; vec added to 
ensure result is indexable

This slowed down my single threaded version a bit but improved my parallel 
speedup from 2x to 3x on an 8 physical core machine. We could easily 
improve this by replacing the cons with conj and updating the code that 
pops the state or by implementing a more efficient indexable stack with 
deftype.

After the change above clojush.interpreter/execute_instruction is looking 
like a hotspot with clojure.lang.ArraySeq creation  seeming to spend more 
time in java.lang.Class.getComponentType() in the parallel version than the 
serial one.

Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-22 Thread Lee Spector

On Dec 21, 2012, at 6:59 PM, Meikel Brandmeyer wrote:
 
 Is there a much simpler way that I overlooked?
 
 I'm not sure it's simpler, but it's more straight-forward, I'd say.
 

Thanks Marshall and Mikel on the struct-record conversion code. I'll 
definitely make a change along those lines.

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Marshall Bockrath-Vandegrift
Wm. Josiah Erikson wmjos...@gmail.com writes:

 I hope this helps people get to the bottom of things.

Not to the bottom of things yet, but found some low-hanging fruit –
switching the `push-state` from a struct-map to a record gives a flat
~2x speedup in all configurations I tested.  So, that’s good?

I have however eliminated to my satisfaction the possibility that this
is something orthogonal to your system.  I do see some speedup
improvements when I lower the level of concurrency to the number of
actual physical cores on my system, but each call to the
`error-function` still takes ~10x as long to complete when run in
parallel vs in serial.

For a while I had some hope that atom-access in the main interpreter
loop was the problem, due to causing extraneous fetches to main memory.
But removing all the atoms from the system didn’t have any appreciable
impact.

Anyway, still poking at it.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Lee Spector

On Dec 21, 2012, at 5:22 PM, Marshall Bockrath-Vandegrift wrote:
 Not to the bottom of things yet, but found some low-hanging fruit –
 switching the `push-state` from a struct-map to a record gives a flat
 ~2x speedup in all configurations I tested.  So, that’s good?

I really appreciate your attention to this Marshall.

FWIW I used records for push-states at one point but did not observe a speedup 
and it required much messier code, so I reverted to struct-maps. But maybe I 
wasn't doing the right timings. I'm curious about how you changed to records 
without the messiness. I'll include below my sig the way that I had to do it... 
maybe you can show me what you did instead.

Still, I guess the gorilla in the room, which is eating the multicore 
performance, hasn't yet been found.

 -Lee

--- tl;dr: the structs vs. records thing ---

Here's how I do what I need to do with struct-maps:

;; this is defined elsewhere, and I want push-states to have fields for each 
push-type that's defined here
(def push-types '(:exec :integer :float :code :boolean :string :zip
:tag :auxiliary :return :environment)

;; so I use a macro to produce the struct-map definition
(defmacro define-push-state-structure []
  `(defstruct push-state ~@push-types))

;; call the macro to define the struct
(define-push-state-structure)

;; and then when I want a new push-state I can just call struct-map:
(defn make-push-state
  Returns an empty push state.
  []
  (struct-map push-state))

In order to do the same thing with records I had to resort to this:

(defn keyword-symbol [kwd]
  Returns the symbol obtained by removing the : from a keyword.
  (read-string (name kwd)))

(defmacro define-push-state-record-type []
  Defines the pushstate record type. The odd trick with read-string was a hack 
to 
avoid namespace qualification on the pushstate symbol.
  `(defrecord ~(read-string pushstate) [~@(map keyword-symbol push-types)]))

(define-push-state-record-type)

(defmacro make-push-state
  Returns an empty push state.
  []
  `(pushstate. ~@(map (fn [_] nil) push-types)))

Is there a much simpler way that I overlooked?




-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes:

 FWIW I used records for push-states at one point but did not observe a
 speedup and it required much messier code, so I reverted to
 struct-maps. But maybe I wasn't doing the right timings. I'm curious
 about how you changed to records without the messiness. I'll include
 below my sig the way that I had to do it... maybe you can show me what
 you did instead.

I just double-checked, and I definitely see a 2x speedup on Josiah’s
benchmark.  That may still be synthetic, of course.  Here’s what I did:

(eval `(defrecord ~'PushState [~'trace ~@(map (comp symbol name) 
push-types)]))

(let [empty-state (map-PushState {})]
  (defn make-push-state
Returns an empty push state.
[] empty-state))

 Still, I guess the gorilla in the room, which is eating the multicore
 performance, hasn't yet been found.

No, not yet...  I’ve become obsessed with figuring it out though, so
still slogging at it.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-21 Thread Meikel Brandmeyer
Hi,

Am 22.12.12 00:37, schrieb Lee Spector:

 ;; this is defined elsewhere, and I want push-states to have fields for each 
 push-type that's defined here
 (def push-types '(:exec :integer :float :code :boolean :string :zip
 :tag :auxiliary :return :environment)
 
 (defn keyword-symbol [kwd]
   Returns the symbol obtained by removing the : from a keyword.
   (read-string (name kwd)))

I'm afraid this one is still necessary, but I would use symbol instead
of read-string.

 (defmacro define-push-state-record-type []
   Defines the pushstate record type. The odd trick with read-string was a 
 hack to 
 avoid namespace qualification on the pushstate symbol.
   `(defrecord ~(read-string pushstate) [~@(map keyword-symbol 
 push-types)]))

Use ~'pushstate instead of read-string.

 (define-push-state-record-type)
 
 (defmacro make-push-state
   Returns an empty push state.
   []
   `(pushstate. ~@(map (fn [_] nil) push-types)))

Use this:

(defn make-push-state
  []
  (map-pushstate {}))

 Is there a much simpler way that I overlooked?

I'm not sure it's simpler, but it's more straight-forward, I'd say.

Kind regards
Meikel


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
So here's what we came up with that clearly demonstrates the problem. Lee
provided the code and I tweaked it until I believe it shows the problem
clearly and succinctly.

I have put together a .tar.gz file that has everything needed to run it,
except lein. Grab it here: clojush_bowling_benchmark.tar.gz

Then run, for instance: /usr/bin/time -f %E lein run
clojush.examples.benchmark-bowling

and then, when that has finished, edit
src/clojush/examples/benchmark_bowling.clj and uncomment
:use-single-thread true and run it again. I think this is a succinct,
deterministic benchmark that clearly demonstrates the problem and also
doesn't use conj or reverse. We don't see slowdowns, but I cannot get any
better than around 2x speedup on any hardware with this benchmark.

I hope this helps people get to the bottom of things.

-Josiah


On Sun, Dec 16, 2012 at 4:54 PM, Lee Spector lspec...@hampshire.edu wrote:


 On Dec 14, 2012, at 10:41 PM, cameron wrote:
  Until Lee has a representative benchmark for his application it's
 difficult to tell if he's
  experiencing the same problem but there would seem to be a case for
 changing the PersistentList
  implementation in clojure.lang.

 We put together a version of our application in which we just replaced all
 of the random calls with deterministic functions (returning either
 constants or deterministic functions of their arguments).

 What we saw was a maximum speedup of less than x2 on a 48 core machine,
 which was interesting in part because the determinism meant that only a
 tiny subset of our code was being executed (because all of the Push
 programs in the population were identical and used only a single Push
 instruction). So I think that this may indeed help us to hone in on the
 problem.

 Before we share that code, however, we should make sure that the
 evaluations that are being done concurrently are sufficiently long-running,
 and maybe tweak a couple of other things. I think we'll have a chance to do
 that early in the week and we'll share the results/code when we do.

 Thanks,

  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
Whoops, sorry about the link. It should be able to be found here:
http://gibson.hampshire.edu/~josiah/clojush/

On Wed, Dec 19, 2012 at 11:57 AM, Wm. Josiah Erikson wmjos...@gmail.comwrote:

 So here's what we came up with that clearly demonstrates the problem. Lee
 provided the code and I tweaked it until I believe it shows the problem
 clearly and succinctly.

 I have put together a .tar.gz file that has everything needed to run it,
 except lein. Grab it here: clojush_bowling_benchmark.tar.gz

 Then run, for instance: /usr/bin/time -f %E lein run
 clojush.examples.benchmark-bowling

 and then, when thWhooat has finished, edit
 src/clojush/examples/benchmark_bowling.clj and uncomment
 :use-single-thread true and run it again. I think this is a succinct,
 deterministic benchmark that clearly demonstrates the problem and also
 doesn't use conj or reverse. We don't see slowdowns, but I cannot get any
 better than around 2x speedup on any hardware with this benchmark.

 I hope this helps people get to the bottom of things.

 -Josiah



 On Sun, Dec 16, 2012 at 4:54 PM, Lee Spector lspec...@hampshire.eduwrote:


 On Dec 14, 2012, at 10:41 PM, cameron wrote:
  Until Lee has a representative benchmark for his application it's
 difficult to tell if he's
  experiencing the same problem but there would seem to be a case for
 changing the PersistentList
  implementation in clojure.lang.

 We put together a version of our application in which we just replaced
 all of the random calls with deterministic functions (returning either
 constants or deterministic functions of their arguments).

 What we saw was a maximum speedup of less than x2 on a 48 core machine,
 which was interesting in part because the determinism meant that only a
 tiny subset of our code was being executed (because all of the Push
 programs in the population were identical and used only a single Push
 instruction). So I think that this may indeed help us to hone in on the
 problem.

 Before we share that code, however, we should make sure that the
 evaluations that are being done concurrently are sufficiently long-running,
 and maybe tweak a couple of other things. I think we'll have a chance to do
 that early in the week and we'll share the results/code when we do.

 Thanks,

  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en




-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Lee Spector

On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote:
  I think this is a succinct, deterministic benchmark that clearly 
 demonstrates the problem and also doesn't use conj or reverse. 

Clarification: it's not just a tight loop involving reverse/conj, as our 
previous benchmark was. It's our real application but with deterministic 
versions of all of the random functions, and while the project includes some 
calls to reverse and conj I don't think they're playing a big role here. 

Almost all of the time here is spent evaluating a Push program that just does a 
lot of integer addition and consing. As far as I can tell all of the consing is 
done with cons explicitly, and not conj although maybe I'm missing 
something, and I'm saying this only from looking at our code, not the Clojure 
libraries.

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Wm. Josiah Erikson
I tried redefining the few places in the code (string_reverse, I think)
that used reverse to use the same version of reverse that I got such great
speedups with in your code, and it made no difference. There are not any
explicit calls to conj in the code that I could find.

On Wed, Dec 19, 2012 at 1:00 PM, Lee Spector lspec...@hampshire.edu wrote:


 On Dec 19, 2012, at 11:57 AM, Wm. Josiah Erikson wrote:
   I think this is a succinct, deterministic benchmark that clearly
 demonstrates the problem and also doesn't use conj or reverse.

 Clarification: it's not just a tight loop involving reverse/conj, as our
 previous benchmark was. It's our real application but with deterministic
 versions of all of the random functions, and while the project includes
 some calls to reverse and conj I don't think they're playing a big role
 here.

 Almost all of the time here is spent evaluating a Push program that just
 does a lot of integer addition and consing. As far as I can tell all of the
 consing is done with cons explicitly, and not conj although maybe I'm
 missing something, and I'm saying this only from looking at our code, not
 the Clojure libraries.

  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-19 Thread Tassilo Horn
Wm. Josiah Erikson wmjos...@gmail.com writes:

 Then run, for instance: /usr/bin/time -f %E lein run
 clojush.examples.benchmark-bowling

 and then, when that has finished, edit
 src/clojush/examples/benchmark_bowling.clj and uncomment
 :use-single-thread true and run it again. I think this is a
 succinct, deterministic benchmark that clearly demonstrates the
 problem and also doesn't use conj or reverse. We don't see slowdowns,
 but I cannot get any better than around 2x speedup on any hardware
 with this benchmark.

FWIW, I've just ran it with these results:

- a dual-core intel notebook:
  + single-threaded: 4:00.09
  + multi-threaded:  2:27.35

- an AMD Opteron 8-core machine [*]
  + single-threaded: 3:03.51
  + multi-threaded:  1:31.58

So indeed, I also get only a speedup of factor 2.

[*] I'm not exactly sure what for a machine that is.  /proc/cpuinfo
reports 8 processors, each being a Quad-Core AMD Opteron(tm) Processor
8387.  Well, that would make 32 cores, but actually htop shows just 8.
But I think its some virtualized machine, so maybe the host has 8
4-cores, but the virtual machine gets only 8 of them...

Bye,
Tassilo

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-16 Thread Lee Spector

On Dec 15, 2012, at 1:14 AM, cameron wrote:
 
 Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java 
 for my GP work but for the last few years it's been GEVA with a clojure 
 wrapper I wrote (https://github.com/cdorrat/geva-clj).

Ah yes -- I've actually downloaded and played a bit with your GEVA via Clojure. 
I've thought it would be interesting to try using GEVA to evolve Push programs 
(which have a trivial grammar), but haven't had the time to try that.

I've also used ECJ for some projects but haven't looked at its multicore 
performance. Sean Luke (ECJ's author and incidentally a former student of mine) 
even hacked together a way to use it to evolve Push programs but it was nasty 
(generating Push programs as strings) and would be low-performance for other 
reasons.

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-16 Thread Lee Spector

On Dec 14, 2012, at 10:41 PM, cameron wrote:
 Until Lee has a representative benchmark for his application it's difficult 
 to tell if he's
 experiencing the same problem but there would seem to be a case for changing 
 the PersistentList
 implementation in clojure.lang.

We put together a version of our application in which we just replaced all of 
the random calls with deterministic functions (returning either constants or 
deterministic functions of their arguments). 

What we saw was a maximum speedup of less than x2 on a 48 core machine, which 
was interesting in part because the determinism meant that only a tiny subset 
of our code was being executed (because all of the Push programs in the 
population were identical and used only a single Push instruction). So I think 
that this may indeed help us to hone in on the problem.

Before we share that code, however, we should make sure that the evaluations 
that are being done concurrently are sufficiently long-running, and maybe tweak 
a couple of other things. I think we'll have a chance to do that early in the 
week and we'll share the results/code when we do.

Thanks,

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread Herwig Hochleitner
I've created a test harness for this as a leiningen plugin:
https://github.com/bendlas/lein-partest

You can just put

:plugins [[net.bendlas/lein-partest 0.1.0]]

into your project and run

lein partest your.ns/testfn 6

to run 6 threads/processes in parallel

The plugin then runs the function as follows:

- (a default of) 2 warmup runs
- 6 times serially
- 6 times in parallel with futures
- 6 times in parallel with classloaders
- 6 times in parallel with processes

A run with the original function (with an i of 1000 instead of 1) on my
machine reveals that running in separate classloaders yields roughly the
same times.
This hints at the bottleneck being in the JVM instead of the clojure
runtime, provided I have used the classloader feature of leiningen
correctly (please verify).
I'd be very interested in the performance on Azul's JVM.

 lein partest example/burn 4
# Run serial ... warmup ...
  #  0, elapsed:   187.57 msecs
  #  1, elapsed:   184.83 msecs
  #  2, elapsed:   183.99 msecs
  #  3, elapsed:   184.33 msecs
# Run parallel, same classloader ... warmup ...
  #  1, elapsed:  1063.02 msecs
  #  2, elapsed:  1071.12 msecs
  #  3, elapsed:  1091.50 msecs
  #  0, elapsed:  1100.42 msecs
# Run parallel, distinct classloaders ... warmup ...
 warmup ...
 warmup ...
 warmup ...
  #  2, elapsed:  1135.74 msecs
  #  3, elapsed:  1118.14 msecs
  #  1, elapsed:  1119.45 msecs
  #  0, elapsed:  1103.45 msecs
# Run parallel, distinct processes ... warmup ...
 warmup ...
 warmup ...
 warmup ...
  #  0, elapsed:   191.96 msecs
  #  2, elapsed:   209.49 msecs
  #  3, elapsed:   191.99 msecs
  #  1, elapsed:   201.93 msecs

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread cameron
Thanks Herwig,
  I used your plugin with the following 2 burn variants:

(defn burn-slow [ _]
  (count (last (take 1000 (iterate #(reduce conj '() %) (range 1))

(defn burn-fast [ _]
  (count (last (take 1000 (iterate #(reduce conj* (list nil) %) (range 
1))

Where conj* is just a copy of clojure.core/conj. In the first case I see 
the same
behaviour you saw, fast serially and in separate processes (~115 ms) but 
slow in parallel for
both the same and different class loaders (708ms - 950ms).

In the second case where I avoid any references to 
clojure.lang.PersistentList$EmptyList
and use a clean copy of conj I see much better performance in all cases (~ 
40ms - 50ms) and no slow down
in the parallel cases in the same JVM.

Until Lee has a representative benchmark for his application it's difficult 
to tell if he's
experiencing the same problem but there would seem to be a case for 
changing the PersistentList
implementation in clojure.lang.

Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-14 Thread cameron
 


 I'd be interested in seeing your GP system. The one we're using evolves 
 Push programs and I suspect that whatever's triggering this problem with 
 multicore utilization is stemming from something in the inner loop of my 
 Push interpreter (https://github.com/lspector/Clojush)... but I don't 
 know exactly what it is. 


Originally I was using ECJ (http://cs.gmu.edu/~eclab/projects/ecj/) in java 
for my GP work but for the last few years it's been GEVA with a clojure 
wrapper I wrote (https://github.com/cdorrat/geva-clj).


  I realize that I could go back to this sort of thing, or something more 
 modern and reasonable like hadoop, but of course it'd be a lot nicer if 
 parallelization via agents (or some other mechanism central to the 
 language) just didn't have whatever pathology we've uncovered. 


Agreed, I'd like clojure to be my preferred language when I have a 100 core 
machine, it's certainly something I'd like to get to the bottom of.

Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
OK, I did something a little bit different, but I think it proves the same
thing we were shooting for.

On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling
benchmark:

1: multithreaded. Average of 10 runs: 14:00.9
2. singlethreaded. Average of 10 runs: 23:35.3
3. singlethreaded, 8 simultaneous copies. Average run time of said
concurrently running copies: 23:31.5

So we see a speedup of less than 2x running multithreaded in a single JVM
instance. By contrast, running 8 simultaneous copies in 8 separate JVM's
gives us a perfect 8 x speedup over running a single instance of the same
singlethreaded benchmark. This proves pretty conclusively that it's not a
hardware limitation, it seems to me unless the problem is that it's
trying to spawn 48 threads, and that creates contention.

I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we
see a very similar lack of speedup going from singlethreaded to
multithreaded (and it will only be trying to use 8 threads, right?), and
then we see a much better speedup (around 4x - we're doing 8 times the work
in twice the amount of time) going to 8 concurrent copies of the same thing
in separate JVM's (even though I had to decrease RAM usage on the 8
concurrent copies to avoid swapping, thereby possibly slowing this down a
bit):
1. 9:00.6
2. 14:15.6
3. 27:35.1

We're probably getting a better speedup with the concurrent copies on the
48-way node because of higher memory bandwidth, bigger caches (and more of
them), and more memory.

Does this help? Should I do something else as well? I'm curious to try
running like, say 16 concurrent copies on the 48-way node


-Josiah

On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut
andy.finger...@gmail.comwrote:

 Lee:

 I believe you said that with your benchmarking code achieved good speedup
 when run as separate JVMs that were each running a single thread, even
 before making the changes to the implementation of reverse found by
 Marshall.  I confirmed that on my own machine as well.

 Have you tried running your real application in a single thread in a JVM,
 and then run multiple JVMs in parallel, to see if there is any speedup?  If
 so, that would again help determine whether it is multiple threads in a
 single JVM causing the slowdown, or something to do with the hardware or OS
 that is the limiting factor.

 Andy


 On Dec 11, 2012, at 4:37 PM, Lee Spector wrote:

 
  On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
  So I think if you replace your calls to `reverse` and any `conj` loops
  you have in your own code, you should see a perfectly reasonable
  speedup.
 
  Tantalizing, but on investigation I see that our real application
 actually does very little explicitly with reverse or conj, and I don't
 actually think that we're getting reasonable speedups (which is what led me
 to try that benchmark). So while I'm not sure of the source of the problem
 in our application I think there can be a problem even if one avoids direct
 calls to reverse and conj. Andy's recent tests also seem to confirm this.
 
  BTW benchmarking our real application (
 https://github.com/lspector/Clojush) is a bit tricky because it's riddled
 with random number generator calls that can have big effects, but we're
 going to look into working around that. Recent postings re: seedable RNGs
 may help, although changing all of the RNG code may be a little involved
 because we use thread-local RNGs (to avoid contention and get good
 multicore speedups... we thought!).
 
  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
Ah. We'll look into running several clojures in one JVM too. Thanks.

On Thu, Dec 13, 2012 at 1:41 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote:

 OK, I did something a little bit different, but I think it proves the same
 thing we were shooting for.

 On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling
 benchmark:

 1: multithreaded. Average of 10 runs: 14:00.9
 2. singlethreaded. Average of 10 runs: 23:35.3
 3. singlethreaded, 8 simultaneous copies. Average run time of said
 concurrently running copies: 23:31.5

 So we see a speedup of less than 2x running multithreaded in a single JVM
 instance. By contrast, running 8 simultaneous copies in 8 separate JVM's
 gives us a perfect 8 x speedup over running a single instance of the same
 singlethreaded benchmark. This proves pretty conclusively that it's not a
 hardware limitation, it seems to me unless the problem is that it's
 trying to spawn 48 threads, and that creates contention.

 I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we
 see a very similar lack of speedup going from singlethreaded to
 multithreaded (and it will only be trying to use 8 threads, right?), and
 then we see a much better speedup (around 4x - we're doing 8 times the work
 in twice the amount of time) going to 8 concurrent copies of the same thing
 in separate JVM's (even though I had to decrease RAM usage on the 8
 concurrent copies to avoid swapping, thereby possibly slowing this down a
 bit):
 1. 9:00.6
 2. 14:15.6
 3. 27:35.1

 We're probably getting a better speedup with the concurrent copies on the
 48-way node because of higher memory bandwidth, bigger caches (and more of
 them), and more memory.

 Does this help? Should I do something else as well? I'm curious to try
 running like, say 16 concurrent copies on the 48-way node


 -Josiah

 On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut andy.finger...@gmail.com
  wrote:

 Lee:

 I believe you said that with your benchmarking code achieved good speedup
 when run as separate JVMs that were each running a single thread, even
 before making the changes to the implementation of reverse found by
 Marshall.  I confirmed that on my own machine as well.

 Have you tried running your real application in a single thread in a JVM,
 and then run multiple JVMs in parallel, to see if there is any speedup?  If
 so, that would again help determine whether it is multiple threads in a
 single JVM causing the slowdown, or something to do with the hardware or OS
 that is the limiting factor.

 Andy


 On Dec 11, 2012, at 4:37 PM, Lee Spector wrote:

 
  On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
  So I think if you replace your calls to `reverse` and any `conj` loops
  you have in your own code, you should see a perfectly reasonable
  speedup.
 
  Tantalizing, but on investigation I see that our real application
 actually does very little explicitly with reverse or conj, and I don't
 actually think that we're getting reasonable speedups (which is what led me
 to try that benchmark). So while I'm not sure of the source of the problem
 in our application I think there can be a problem even if one avoids direct
 calls to reverse and conj. Andy's recent tests also seem to confirm this.
 
  BTW benchmarking our real application (
 https://github.com/lspector/Clojush) is a bit tricky because it's
 riddled with random number generator calls that can have big effects, but
 we're going to look into working around that. Recent postings re: seedable
 RNGs may help, although changing all of the RNG code may be a little
 involved because we use thread-local RNGs (to avoid contention and get good
 multicore speedups... we thought!).
 
  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en




-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Andy Fingerhut
I'm not saying that I know this will help, but if you are open to trying a 
different JVM that has had a lot of work done on it to optimize it for high 
concurrency, Azul's Zing JVM may be worth a try, to see if it increases 
parallelism for a single Clojure instance in a single JVM, with lots of threads.

It costs $$, but I'm guessing they might have steep discounts for educational 
institutions.  They have free trials, too.

http://www.azulsystems.com/products/zing/whatisit

Andy

On Dec 13, 2012, at 10:41 AM, Wm. Josiah Erikson wrote:

 OK, I did something a little bit different, but I think it proves the same 
 thing we were shooting for. 
 
 On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling 
 benchmark: 
 
 1: multithreaded. Average of 10 runs: 14:00.9 
 2. singlethreaded. Average of 10 runs: 23:35.3 
 3. singlethreaded, 8 simultaneous copies. Average run time of said 
 concurrently running copies: 23:31.5 
 
 So we see a speedup of less than 2x running multithreaded in a single JVM 
 instance. By contrast, running 8 simultaneous copies in 8 separate JVM's 
 gives us a perfect 8 x speedup over running a single instance of the same 
 singlethreaded benchmark. This proves pretty conclusively that it's not a 
 hardware limitation, it seems to me unless the problem is that it's 
 trying to spawn 48 threads, and that creates contention. 
 
 I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we see 
 a very similar lack of speedup going from singlethreaded to multithreaded 
 (and it will only be trying to use 8 threads, right?), and then we see a much 
 better speedup (around 4x - we're doing 8 times the work in twice the amount 
 of time) going to 8 concurrent copies of the same thing in separate JVM's 
 (even though I had to decrease RAM usage on the 8 concurrent copies to avoid 
 swapping, thereby possibly slowing this down a bit): 
 1. 9:00.6 
 2. 14:15.6 
 3. 27:35.1 
 
 We're probably getting a better speedup with the concurrent copies on the 
 48-way node because of higher memory bandwidth, bigger caches (and more of 
 them), and more memory. 
 
 Does this help? Should I do something else as well? I'm curious to try 
 running like, say 16 concurrent copies on the 48-way node 
 
 
 -Josiah 
 
 On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut andy.finger...@gmail.com 
 wrote:
 Lee:
 
 I believe you said that with your benchmarking code achieved good speedup 
 when run as separate JVMs that were each running a single thread, even before 
 making the changes to the implementation of reverse found by Marshall.  I 
 confirmed that on my own machine as well.
 
 Have you tried running your real application in a single thread in a JVM, and 
 then run multiple JVMs in parallel, to see if there is any speedup?  If so, 
 that would again help determine whether it is multiple threads in a single 
 JVM causing the slowdown, or something to do with the hardware or OS that is 
 the limiting factor.
 
 Andy
 
 
 On Dec 11, 2012, at 4:37 PM, Lee Spector wrote:
 
 
  On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
  So I think if you replace your calls to `reverse` and any `conj` loops
  you have in your own code, you should see a perfectly reasonable
  speedup.
 
  Tantalizing, but on investigation I see that our real application actually 
  does very little explicitly with reverse or conj, and I don't actually 
  think that we're getting reasonable speedups (which is what led me to try 
  that benchmark). So while I'm not sure of the source of the problem in our 
  application I think there can be a problem even if one avoids direct calls 
  to reverse and conj. Andy's recent tests also seem to confirm this.
 
  BTW benchmarking our real application (https://github.com/lspector/Clojush) 
  is a bit tricky because it's riddled with random number generator calls 
  that can have big effects, but we're going to look into working around 
  that. Recent postings re: seedable RNGs may help, although changing all of 
  the RNG code may be a little involved because we use thread-local RNGs (to 
  avoid contention and get good multicore speedups... we thought!).
 
  -Lee
 
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 
 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Wm. Josiah Erikson
Cool. I've requested a free trial.

On Thu, Dec 13, 2012 at 1:53 PM, Andy Fingerhut andy.finger...@gmail.comwrote:

 I'm not saying that I know this will help, but if you are open to trying a
 different JVM that has had a lot of work done on it to optimize it for high
 concurrency, Azul's Zing JVM may be worth a try, to see if it increases
 parallelism for a single Clojure instance in a single JVM, with lots of
 threads.

 It costs $$, but I'm guessing they might have steep discounts for
 educational institutions.  They have free trials, too.

 http://www.azulsystems.com/products/zing/whatisit

 Andy

 On Dec 13, 2012, at 10:41 AM, Wm. Josiah Erikson wrote:

 OK, I did something a little bit different, but I think it proves the same
 thing we were shooting for.

 On a 48-way 4 x Opteron 6168 with 32GB of RAM. This is Tom's Bowling
 benchmark:

 1: multithreaded. Average of 10 runs: 14:00.9
 2. singlethreaded. Average of 10 runs: 23:35.3
 3. singlethreaded, 8 simultaneous copies. Average run time of said
 concurrently running copies: 23:31.5

 So we see a speedup of less than 2x running multithreaded in a single JVM
 instance. By contrast, running 8 simultaneous copies in 8 separate JVM's
 gives us a perfect 8 x speedup over running a single instance of the same
 singlethreaded benchmark. This proves pretty conclusively that it's not a
 hardware limitation, it seems to me unless the problem is that it's
 trying to spawn 48 threads, and that creates contention.

 I don't think so though, because on an 8-way FX-8120 with 16GB of RAM, we
 see a very similar lack of speedup going from singlethreaded to
 multithreaded (and it will only be trying to use 8 threads, right?), and
 then we see a much better speedup (around 4x - we're doing 8 times the work
 in twice the amount of time) going to 8 concurrent copies of the same thing
 in separate JVM's (even though I had to decrease RAM usage on the 8
 concurrent copies to avoid swapping, thereby possibly slowing this down a
 bit):
 1. 9:00.6
 2. 14:15.6
 3. 27:35.1

 We're probably getting a better speedup with the concurrent copies on the
 48-way node because of higher memory bandwidth, bigger caches (and more of
 them), and more memory.

 Does this help? Should I do something else as well? I'm curious to try
 running like, say 16 concurrent copies on the 48-way node


 -Josiah

 On Wed, Dec 12, 2012 at 10:03 AM, Andy Fingerhut andy.finger...@gmail.com
  wrote:

 Lee:

 I believe you said that with your benchmarking code achieved good speedup
 when run as separate JVMs that were each running a single thread, even
 before making the changes to the implementation of reverse found by
 Marshall.  I confirmed that on my own machine as well.

 Have you tried running your real application in a single thread in a JVM,
 and then run multiple JVMs in parallel, to see if there is any speedup?  If
 so, that would again help determine whether it is multiple threads in a
 single JVM causing the slowdown, or something to do with the hardware or OS
 that is the limiting factor.

 Andy


 On Dec 11, 2012, at 4:37 PM, Lee Spector wrote:

 
  On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
  So I think if you replace your calls to `reverse` and any `conj` loops
  you have in your own code, you should see a perfectly reasonable
  speedup.
 
  Tantalizing, but on investigation I see that our real application
 actually does very little explicitly with reverse or conj, and I don't
 actually think that we're getting reasonable speedups (which is what led me
 to try that benchmark). So while I'm not sure of the source of the problem
 in our application I think there can be a problem even if one avoids direct
 calls to reverse and conj. Andy's recent tests also seem to confirm this.
 
  BTW benchmarking our real application (
 https://github.com/lspector/Clojush) is a bit tricky because it's
 riddled with random number generator calls that can have big effects, but
 we're going to look into working around that. Recent postings re: seedable
 RNGs may help, although changing all of the RNG code may be a little
 involved because we use thread-local RNGs (to avoid contention and get good
 multicore speedups... we thought!).
 
  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en



 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread cameron


On Friday, December 14, 2012 5:41:59 AM UTC+11, Wm. Josiah Erikson wrote:


 Does this help? Should I do something else as well? I'm curious to try 
 running like, say 16 concurrent copies on the 48-way node 

 Have you made any progress on a small deterministic benchmark that 
reflects your applications behaviour (ie. the RNG seed work you were 
discussing)? I'm keen to help, but I don't have time to look at benchmarks 
that take hours to run.

I've also been using clojure for genetic programming and have been getting 
very good parallel speedups on a single jvm, this type of problem should 
see excellent performance gains, it has many medium size cpu intensive 
sub-problems that require no shared resources. I used hadoop to scale 
beyond single machines, in the worst case you can use this approach and you 
should see speed-ups equivalent to your multiple jvm test. I just split the 
population into small blocks and had each map function calculate the 
fitness and return a map of individual-id - fitness.

Cameron.


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-13 Thread Lee Spector

On Dec 13, 2012, at 4:21 PM, cameron wrote:
 
 Have you made any progress on a small deterministic benchmark that reflects 
 your applications behaviour (ie. the RNG seed work you were discussing)? I'm 
 keen to help, but I don't have time to look at benchmarks that take hours to 
 run.
 
 I've also been using clojure for genetic programming and have been getting 
 very good parallel speedups on a single jvm, this type of problem should see 
 excellent performance gains, it has many medium size cpu intensive 
 sub-problems that require no shared resources. I used hadoop to scale beyond 
 single machines, in the worst case you can use this approach and you should 
 see speed-ups equivalent to your multiple jvm test. I just split the 
 population into small blocks and had each map function calculate the fitness 
 and return a map of individual-id - fitness.
 
 Cameron.


I don't think we've quite worked out a deterministic test, but I'll look into 
it some more.

I'd be interested in seeing your GP system. The one we're using evolves Push 
programs and I suspect that whatever's triggering this problem with multicore 
utilization is stemming from something in the inner loop of my Push interpreter 
(https://github.com/lspector/Clojush)... but I don't know exactly what it is.

I've never used hadoop, but I've done lots of coarse-grained parallel GP using 
(gasp) shell scripts and the file system to glue the independent parts 
together. I realize that I could go back to this sort of thing, or something 
more modern and reasonable like hadoop, but of course it'd be a lot nicer if 
parallelization via agents (or some other mechanism central to the language) 
just didn't have whatever pathology we've uncovered.

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread cameron

Hi Marshall,
  the megamorphic call site hypothesis does sound plausible but I'm not 
sure where the following test fits in.
If I understand correctly we believe that it's the fact that the base case 
(an PersistentList$EmptyList instance)
and the normal case (an PersistsentList instance) have different types and 
when run in paralell the interleaving
invocations are causing problems for the JIT but when we use a vector it's 
a single type so we see a better speedup.

I was toying with the idea of replacing the EmptyList class with a 
PersistsentList instance to mitigate the problem
in at least one common case, however it doesn't seem to help.
If I replace the reverse call in burn with the following code:
  #(reduce conj (list nil) %)
I get the same slowdown as we see if reverse (equivalent to #(reduce conj 
'() %))

where:
  (class '())= clojure.lang.PersistentList$EmptyList
  (class (list nil)) = clojure.lang.PersistentList

I'm not convinced were at the bottom of it yet, perhaps the earlier post on 
memory contention by Carlos might
yield some clues. If the issue was related to memory access it might help 
explain why the impact differed
significantly between intel and amd hardware.

Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Marshall Bockrath-Vandegrift
Andy Fingerhut andy.finger...@gmail.com writes:

 I'm not practiced in recognizing megamorphic call sites, so I could be
 missing some in the example code below, modified from Lee's original
 code.  It doesn't use reverse or conj, and as far as I can tell
 doesn't use PersistentList, either, only Cons.

...

 Can you try to reproduce to see if you get similar results?  If so, do
 you know why we get bad parallelism in a single JVM for this code?  If
 there are no megamorphic call sites, then it is examples like this
 that lead me to wonder about locking in memory allocation and/or GC.

I think your benchmark is a bit different from Lee’s original.  The
`reverse`-based versions perform heavily allocation as they repeatedly
reverse a sequence, but each thread will hold a sequence of length at
most 10,000 at any given time.  In your benchmark, each thread holds a
sequence of at most 2,000,000 elements, for a naive 200x increase in
memory pressure and a potential increase in the number of objects being
promoted out of the young generation.

I ran your run benchmark under a version of Cameron’s criterium-based
speed-up measurement wrapper I’ve modified to pass in the `pmap`
function to use.  I reduced the number of iterations in your algorithm
by a factor of 5 to get it to run in a reasonable amount of time.  And I
ran it using default JVM GC settings, on a 32-way AMD system.

I get the following numbers for 1-32 way parallelism with a 500MB heap:

andy  1 : smap-ms 7.5, pmap-ms 7.7, speedup 0.97
andy  2 : smap-ms 7.8, pmap-ms 9.8, speedup 0.80
andy  4 : smap-ms 8.5, pmap-ms 10.6, speedup 0.80
andy  8 : smap-ms 8.6, pmap-ms 11.5, speedup 0.75
andy 16 : smap-ms 8.1, pmap-ms 12.5, speedup 0.65
andy 32 : [java.lang.OutOfMemoryError: Java heap space]

And these numbers with a 4GB heap:

andy  1 : smap-ms 3.8, pmap-ms 4.0, speedup 0.95
andy  2 : smap-ms 4.2, pmap-ms 2.1, speedup 2.02
andy  4 : smap-ms 4.2, pmap-ms 1.7, speedup 2.48
andy  8 : smap-ms 4.2, pmap-ms 1.2, speedup 3.44
andy 16 : smap-ms 4.4, pmap-ms 1.0, speedup 4.52
andy 32 : smap-ms 4.0, pmap-ms 1.6, speedup 2.55

I’m running out of time for breakfast experiments, but it seems
relatively likely to me that the increased at-once sequence size in your
benchmark is increasing the number of objects making it out of the young
generation.  This in turn is increasing the number of pause-the-world
GCs, which increase even further in frequency at lower heap sizes.  I’ll
run these again later with GC logging and report if the results are
unexpected.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Marshall Bockrath-Vandegrift
cameron cdor...@gmail.com writes:

   the megamorphic call site hypothesis does sound plausible but I'm
 not sure where the following test fits in.

...

 I was toying with the idea of replacing the EmptyList class with a
 PersistsentList instance to mitigate the problem
 in at least one common case, however it doesn't seem to help.
 If I replace the reverse call in burn with the following code:
   #(reduce conj (list nil) %)
 I get the same slowdown as we see if reverse (equivalent to #(reduce
 conj '() %))

Ah, but include your own copy of `conj` and try those two cases.  The
existing clojure.core/conj has already been used on multiple types, so
you need a new IFn class with a fresh call site.  Here are the numbers I
get when I do that:

w/o EmptyList : smap-ms 6.1, pmap-ms 1.2, speedup 5.26
w/  EmptyList : smap-ms 10.4, pmap-ms 16.2, speedup 0.64
w/o EmptyList : smap-ms 10.5, pmap-ms 16.3, speedup 0.64

That said, I’m slightly less convinced than I was earlier.  I’m having
difficulty producing a minimal example demonstrating the issue, and the
results wmjosiah reported for modifying their actual code are
disheartening.  I just don’t have anything else which begins to explain
the transition from speedup to inverse speedup in the above benchmarking
sequence.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Andy Fingerhut
Lee:

I believe you said that with your benchmarking code achieved good speedup when 
run as separate JVMs that were each running a single thread, even before making 
the changes to the implementation of reverse found by Marshall.  I confirmed 
that on my own machine as well.

Have you tried running your real application in a single thread in a JVM, and 
then run multiple JVMs in parallel, to see if there is any speedup?  If so, 
that would again help determine whether it is multiple threads in a single JVM 
causing the slowdown, or something to do with the hardware or OS that is the 
limiting factor.

Andy


On Dec 11, 2012, at 4:37 PM, Lee Spector wrote:

 
 On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
 So I think if you replace your calls to `reverse` and any `conj` loops
 you have in your own code, you should see a perfectly reasonable
 speedup.
 
 Tantalizing, but on investigation I see that our real application actually 
 does very little explicitly with reverse or conj, and I don't actually think 
 that we're getting reasonable speedups (which is what led me to try that 
 benchmark). So while I'm not sure of the source of the problem in our 
 application I think there can be a problem even if one avoids direct calls to 
 reverse and conj. Andy's recent tests also seem to confirm this.
 
 BTW benchmarking our real application (https://github.com/lspector/Clojush) 
 is a bit tricky because it's riddled with random number generator calls that 
 can have big effects, but we're going to look into working around that. 
 Recent postings re: seedable RNGs may help, although changing all of the RNG 
 code may be a little involved because we use thread-local RNGs (to avoid 
 contention and get good multicore speedups... we thought!).
 
 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Lee Spector

On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote:
 
 Have you tried running your real application in a single thread in a JVM, and 
 then run multiple JVMs in parallel, to see if there is any speedup?  If so, 
 that would again help determine whether it is multiple threads in a single 
 JVM causing the slowdown, or something to do with the hardware or OS that is 
 the limiting factor.

I don't think we've tried this exact test but we will now. Because we haven't 
yet engineered all of the randomness out of the application we'll do a bunch of 
runs in each condition and average the times. Based on recent tests I think 
this will give reasonably stable results if we do 10 runs or so in each 
condition. I'll keep you posted.

Thanks!

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Christophe Grand
Lee, while you are at benchmarking, would you mind running several threads
in one JVM with one clojure instance per thread? Thus each thread should
get JITted independently.

Christophe


On Wed, Dec 12, 2012 at 4:11 PM, Lee Spector lspec...@hampshire.edu wrote:


 On Dec 12, 2012, at 10:03 AM, Andy Fingerhut wrote:
 
  Have you tried running your real application in a single thread in a
 JVM, and then run multiple JVMs in parallel, to see if there is any
 speedup?  If so, that would again help determine whether it is multiple
 threads in a single JVM causing the slowdown, or something to do with the
 hardware or OS that is the limiting factor.

 I don't think we've tried this exact test but we will now. Because we
 haven't yet engineered all of the randomness out of the application we'll
 do a bunch of runs in each condition and average the times. Based on recent
 tests I think this will give reasonably stable results if we do 10 runs or
 so in each condition. I'll keep you posted.

 Thanks!

  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en




-- 
On Clojure http://clj-me.cgrand.net/
Clojure Programming http://clojurebook.com
Training, Consulting  Contracting http://lambdanext.eu/

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Lee Spector

On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote:
 Lee, while you are at benchmarking, would you mind running several threads in 
 one JVM with one clojure instance per thread? Thus each thread should get 
 JITted independently.

I'm not actually sure how to do that. We're starting runs with lein run, 
which, if I understand correctly launches a JVM and a Clojure instance within 
it. Within that Clojure instance we can do things in either a single threaded 
or a multi threaded way. How would I change our method to do what you're asking 
here? We're not particularly Java savvy (I come from the Lisp side of things).

Thanks,

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread cameron


On Thursday, December 13, 2012 12:51:57 AM UTC+11, Marshall 
Bockrath-Vandegrift wrote:

 cameron cdo...@gmail.com javascript: writes: 

the megamorphic call site hypothesis does sound plausible but I'm 
  not sure where the following test fits in. 

 ... 

  I was toying with the idea of replacing the EmptyList class with a 
  PersistsentList instance to mitigate the problem 
  in at least one common case, however it doesn't seem to help. 
  If I replace the reverse call in burn with the following code: 
#(reduce conj (list nil) %) 
  I get the same slowdown as we see if reverse (equivalent to #(reduce 
  conj '() %)) 

 Ah, but include your own copy of `conj` and try those two cases.  The 
 existing clojure.core/conj has already been used on multiple types, so 
 you need a new IFn class with a fresh call site.  Here are the numbers I 
 get when I do that: 

 
Ah, I've also been looking at this in the morning and missed that bit.
When I use a copy of conj I get similar results to yourself:
  w/ list-base : map-ms: 4.0, pmap-ms 0.6, speedup 7.02
  w/ empty-list-base : map-ms: 9.9, pmap-ms 20.4, speedup 0.48
  w/ list-base : map-ms: 9.4, pmap-ms 20.7, speedup 0.45
  w/ vector : map-ms: 4.5, pmap-ms 1.4, speedup 3.28

It does seem that once used on the empty list case conj gets tainted and 
all future list
uses incur a performance penalty. On that basis it would seem reasonable
 to convert the EmptyList uses in core.lang to a PersistentList instance.

Interestingly using a tainted conj with other types doesn't seem to incur 
the penalty
(the last vector in the times above)

If I get time I might look at a protocol based version of conj out of 
interest.

Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-12 Thread Christophe Grand
See https://github.com/flatland/classlojure for a, nearly, ready-made
solution to running several Clojures in one JVM.


On Wed, Dec 12, 2012 at 5:20 PM, Lee Spector lspec...@hampshire.edu wrote:


 On Dec 12, 2012, at 10:45 AM, Christophe Grand wrote:
  Lee, while you are at benchmarking, would you mind running several
 threads in one JVM with one clojure instance per thread? Thus each thread
 should get JITted independently.

 I'm not actually sure how to do that. We're starting runs with lein run,
 which, if I understand correctly launches a JVM and a Clojure instance
 within it. Within that Clojure instance we can do things in either a single
 threaded or a multi threaded way. How would I change our method to do what
 you're asking here? We're not particularly Java savvy (I come from the Lisp
 side of things).

 Thanks,

  -Lee

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en




-- 
On Clojure http://clj-me.cgrand.net/
Clojure Programming http://clojurebook.com
Training, Consulting  Contracting http://lambdanext.eu/

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
nicolas.o...@gmail.com nicolas.o...@gmail.com writes:

 What happens if your run it a third time at the end?   (The question
 is related to the fact that there appears to be transition states
 between monomorphic and megamorphic call sites,  which might lead to
 an explanation.)

Same results, but your comment jogged my reading and analysis is what I
believe is the right direction.

This is a potentially grandiose claim, but I believe the inverse speedup
is due to contention caused by JVM type profiling in megamorphic call
sites.  Unfortunately -XX:-UseTypeProfile doesn’t appear to turn off
type profiling itself, so I can’t prove this claim definitively without
going even further down the rabbit hole and grubbing through the JVM
source code.

To back up this claim, I present the following modified `conj*`:

(defn conj*
  [coll x]
  (let [a (long-array 32)]
(dotimes [n 5] (dotimes [i 32] (aset a i n)))
(clojure.lang.RT/conj coll x)))

And the resulting profiling numbers:

list-conj* : map-ms: 42.0, pmap-ms 24.8, speedup 1.69
cons-conj* : map-ms: 39.7, pmap-ms 25.1, speedup 1.58

Adding busy-work (and an extra allocation!) to the loop improved the
speed-up, I believe by decreasing the relative portion of the call
execution time during which the callsite type profile information is
being updated.  If I’m correct, the `Cons` and `PersistentList` `.cons`
implementations are so tight that in their base versions the type
profiling forms a significant enough portion of the total call that the
updates are in frequent conflict.  Adding busy-work to the `conj*` call
adds some jitter which prevents them from contending quite as
frequently.

I’m not sure what the next steps are.  Open a bug on the JVM?  This is
something one can attempt to circumvent on a case-by-case basis, but
IHMO has significant negative implications for Clojure’s concurrency
story.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector
On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote:
 I’m not sure what the next steps are.  Open a bug on the JVM?  This is
 something one can attempt to circumvent on a case-by-case basis, but
 IHMO has significant negative implications for Clojure’s concurrency
 story.

I've gotten a bit lost in some of the diagnoses and experimental results and 
analyses and suggestions -- all of which I really appreciate! -- but I'm trying 
to get clear in my mind what the current state of things is from an application 
perspective.

Is the following a fair characterization pending further developments?

---
If you have a cons-intensive task then even if it can be divided into 
completely independent, long-running subtasks, there is currently no known way 
to get significant speedups by running the subtasks on multiple cores within a 
single Clojure process. In some cases you will be able to get significant 
speedups by separating the subtasks completely and running them in separate 
Clojure processes running on separate JVM instances. But the speedups will be 
lost (mostly, and you might even experience slowdowns) if you try to run them 
from within a single Clojure process.
---

Or have I missed a currently-available work-around among the many suggestions?

I realize that even if this is the current situation it might change via fixes 
on any one of several fronts (and I hope that it does!), but is this indeed the 
 current situation?

Thanks so much,

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Gary Johnson
Lee,

  My reading of this thread is not quite as pessimistic as yours. Here is 
my synthesis for the practical application developer in Clojure from 
reading and re-reading all of the posts above. Marshall and Cameron, please 
feel free to correct me if I screw anything up here royally. ;-)

When running code in parallel, the fastest to slowest ways to allocate 
memory via Clojure (of those examined above) are:
1. Java array
2. Java linked list
3. conj onto a transient vector
4. cons onto a list
5. conj onto a vector
6. DO NOT conj onto a list EVER!!!

And that's pretty much all there is to it until Marshall figures out how to 
hack the JVM to overcome this very subtle JIT deoptimization bug relating 
to polymorphic conj calls on Cons and PersistentList types.

  Good luck,
~Gary


On Tuesday, December 11, 2012 10:37:50 AM UTC-5, Lee wrote:

 On Dec 11, 2012, at 4:37 AM, Marshall Bockrath-Vandegrift wrote: 
  I’m not sure what the next steps are.  Open a bug on the JVM?  This is 
  something one can attempt to circumvent on a case-by-case basis, but 
  IHMO has significant negative implications for Clojure’s concurrency 
  story. 

 I've gotten a bit lost in some of the diagnoses and experimental results 
 and analyses and suggestions -- all of which I really appreciate! -- but 
 I'm trying to get clear in my mind what the current state of things is from 
 an application perspective. 

 Is the following a fair characterization pending further developments? 

 --- 
 If you have a cons-intensive task then even if it can be divided into 
 completely independent, long-running subtasks, there is currently no known 
 way to get significant speedups by running the subtasks on multiple cores 
 within a single Clojure process. In some cases you will be able to get 
 significant speedups by separating the subtasks completely and running them 
 in separate Clojure processes running on separate JVM instances. But the 
 speedups will be lost (mostly, and you might even experience slowdowns) if 
 you try to run them from within a single Clojure process. 
 --- 

 Or have I missed a currently-available work-around among the many 
 suggestions? 

 I realize that even if this is the current situation it might change via 
 fixes on any one of several fronts (and I hope that it does!), but is this 
 indeed the  current situation? 

 Thanks so much, 

  -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes:

 Is the following a fair characterization pending further developments?

 If you have a cons-intensive task then even if it can be divided into
 completely independent, long-running subtasks, there is currently no
 known way to get significant speedups by running the subtasks on
 multiple cores within a single Clojure process. 

Not quite.  If you’d been using `cons` (in the benchmark, if `reverse`
used `cons` in its implementation), then you’d be getting a perfectly
reasonable speedup.  The problem child in this instance is `conj`.

If my analysis is correct, then the issue is any megamodal call site –
such as `conj` – which is invoked in a tight loop by multiple threads
simultaneously.  Any simultaneous invocation of such call sites
introduces contention and reduces speedup, but the problem only becomes
pathological in very, very tight loops, such as when performing the
minimal work required by the `.cons` [1] implementations of `Cons` and
`PersistentList`.  In these cases the portion of the call which
introduces contention is a sufficient proportion of the overall call
time that the speedup becomes inverse.

 In some cases you will be able to get significant speedups by
 separating the subtasks completely and running them in separate
 Clojure processes running on separate JVM instances.  But the speedups
 will be lost (mostly, and you might even experience slowdowns) if you
 try to run them from within a single Clojure process.

For this particular issue, splitting each task into a separate JVM
entirely negates the problem, because there is no simultaneous
invocation of the same call site.

 Or have I missed a currently-available work-around among the many
 suggestions?

You can specialize your application to avoid megamodal call sites in
tight loops.  If you are working with `Cons`-order sequences, just use
`cons` instead of `conj`.  If you are working with vectors, create your
own private implementation of `conj` which you *only* call on vectors.
If you are depending on operations which may/do use `conj` in tight
loops, create your own private re-implementations which don’t, such as
with any of the faster versions of `reverse` earlier in this thread.

This is suboptimal, but it’s totally possible to work around the issue
with a little bit of analysis and profiling.

[1] Possible point of confusion – the JVM interface method invoked by
the Clojure `conj` function is named `.cons`, for I assume historical
reasons.  The Clojure `cons` function on the other hand just allocates a
`Cons` object in an entirely monomodal fashion.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector

On Dec 11, 2012, at 11:40 AM, Marshall Bockrath-Vandegrift wrote:
 
 Or have I missed a currently-available work-around among the many
 suggestions?
 
 You can specialize your application to avoid megamodal call sites in
 tight loops.  If you are working with `Cons`-order sequences, just use
 `cons` instead of `conj`.  If you are working with vectors, create your
 own private implementation of `conj` which you *only* call on vectors.
 If you are depending on operations which may/do use `conj` in tight
 loops, create your own private re-implementations which don’t, such as
 with any of the faster versions of `reverse` earlier in this thread.
 
 This is suboptimal, but it’s totally possible to work around the issue
 with a little bit of analysis and profiling.

Very informative. Thanks and thanks also to Gary.

If the application does lots of list processing but does so with a mix of 
Clojure list and sequence manipulation functions, then one would have to write 
private, list/cons-only versions of all of these things? That is -- overstating 
it a bit, to be sure, but perhaps not entirely unfairly -- re-implement 
Clojure's Lisp? 

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes:

 If the application does lots of list processing but does so with a
 mix of Clojure list and sequence manipulation functions, then one
 would have to write private, list/cons-only versions of all of these
 things? That is -- overstating it a bit, to be sure, but perhaps not
 entirely unfairly -- re-implement Clojure's Lisp?

I just did a quick look over clojure/core.clj, and `reverse` is the only
function which stood out to me as hitting the most pathological case.
Every other `conj` loop over a user-provided datastructure is `conj`ing
into an explicit non-list/`Cons` type.

So I think if you replace your calls to `reverse` and any `conj` loops
you have in your own code, you should see a perfectly reasonable
speedup.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
OK WOW. You hit the nail on the head. It's reverse being called in a pmap
that does it. When I redefine my own version of reverse (I totally cheated
and just stole this) like this:

(defn reverse-recursively [coll]
  (loop [[r  more :as all] (seq coll)
 acc '()]
(if all
  (recur more (cons r acc))
  acc)))



I speed it up from, get this, 4:32 to 0:25. Yeah.

In case anybody's curious, pmap and pmapall show identical performance in
this case.

Wow.

-Josiah

On Tue, Dec 11, 2012 at 1:06 PM, Marshall Bockrath-Vandegrift 
llas...@gmail.com wrote:

 Lee Spector lspec...@hampshire.edu writes:

  If the application does lots of list processing but does so with a
  mix of Clojure list and sequence manipulation functions, then one
  would have to write private, list/cons-only versions of all of these
  things? That is -- overstating it a bit, to be sure, but perhaps not
  entirely unfairly -- re-implement Clojure's Lisp?

 I just did a quick look over clojure/core.clj, and `reverse` is the only
 function which stood out to me as hitting the most pathological case.
 Every other `conj` loop over a user-provided datastructure is `conj`ing
 into an explicit non-list/`Cons` type.

 So I think if you replace your calls to `reverse` and any `conj` loops
 you have in your own code, you should see a perfectly reasonable
 speedup.

 -Marshall

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core i7
3770K, when before it was very very much not so. So for some reason, this
bug was tickled more dramatically on AMD multicore processors than on Intel
ones.

On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote:

 OK WOW. You hit the nail on the head. It's reverse being called in a
 pmap that does it. When I redefine my own version of reverse (I totally
 cheated and just stole this) like this:

 (defn reverse-recursively [coll]
   (loop [[r  more :as all] (seq coll)
  acc '()]
 (if all
   (recur more (cons r acc))
   acc)))



 I speed it up from, get this, 4:32 to 0:25. Yeah.

 In case anybody's curious, pmap and pmapall show identical performance in
 this case.

 Wow.

 -Josiah


 On Tue, Dec 11, 2012 at 1:06 PM, Marshall Bockrath-Vandegrift 
 llas...@gmail.com wrote:

 Lee Spector lspec...@hampshire.edu writes:

  If the application does lots of list processing but does so with a
  mix of Clojure list and sequence manipulation functions, then one
  would have to write private, list/cons-only versions of all of these
  things? That is -- overstating it a bit, to be sure, but perhaps not
  entirely unfairly -- re-implement Clojure's Lisp?

 I just did a quick look over clojure/core.clj, and `reverse` is the only
 function which stood out to me as hitting the most pathological case.
 Every other `conj` loop over a user-provided datastructure is `conj`ing
 into an explicit non-list/`Cons` type.

 So I think if you replace your calls to `reverse` and any `conj` loops
 you have in your own code, you should see a perfectly reasonable
 speedup.

 -Marshall

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en




-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Andy Fingerhut
Marshall:

I'm not practiced in recognizing megamorphic call sites, so I could be missing 
some in the example code below, modified from Lee's original code.  It doesn't 
use reverse or conj, and as far as I can tell doesn't use PersistentList, 
either, only Cons.

(defn burn-cons [size]
  (let [size (long size)]
(loop [i (long 0)
   value nil]
  (if (= i size)
(last value)
(recur (inc i) (clojure.lang.Cons.
(* (int i)
   (+ (float i)
  (- (int i)
 (/ (float i)
(inc (int i))
value))

(a) invoke (burn-cons 200) sequentially 64 times in a single JVM

(b) invoke (burn-cons 200) 64 times using a modified version of pmap that 
limits the number of active threads to 2 (see below), in a single JVM.  I would 
hope that this would take about half the elapsed time than (a), but the elapsed 
time is longer than (a)

(c) start up two JVMs simultaneously and invoke (burn-cons 200) 
sequentially 32 times in each.  The elapsed time here is less than (a), as I 
would expect.

(Clojure 1.4, Oracle/Apple JDK 1.6.0_37, Mac OS X 10.6.8, running on a machine 
with Intel core i7 with 4 physical cores but OS X reports it as 8 I think 
because of 2 hyperthreads per core -- more details available on request).

Can you try to reproduce to see if you get similar results?  If so, do you know 
why we get bad parallelism in a single JVM for this code?  If there are no 
megamorphic call sites, then it is examples like this that lead me to wonder 
about locking in memory allocation and/or GC.


With the functions below, my part (b) was measured by doing:

(time (doall (nthreads-pmap 2 (burn-cons 200) (unchunk (range 64)

Andy



(defn unchunk [s]
  (when (seq s)
(lazy-seq
 (cons (first s)
   (unchunk (next s))

(defn nthreads-pmap
  Like pmap, except can take an argument nthreads to control the
  maximum number of parallel threads used.
  ([f coll]
 (let [n (+ 2 (.. Runtime getRuntime availableProcessors))]
   (nthreads-pmap n f coll)))
  ([nthreads f coll]
 (if (= nthreads 1)
   (map f coll)
   (let [n (dec nthreads)
 rets (map #(future (f %)) coll)
 step (fn step [[x  xs :as vs] fs]
(lazy-seq
 (if-let [s (seq fs)]
   (cons (deref x) (step xs (rest s)))
   (map deref vs]
 (step rets (drop n rets)
  ([nthreads f coll  colls]
   (let [step (fn step [cs]
(lazy-seq
 (let [ss (map seq cs)]
   (when (every? identity ss)
 (cons (map first ss) (step (map rest ss)))]
 (nthreads-pmap nthreads #(apply f %) (step (cons coll colls))


On Dec 11, 2012, at 10:06 AM, Marshall Bockrath-Vandegrift wrote:

 Lee Spector lspec...@hampshire.edu writes:
 
 If the application does lots of list processing but does so with a
 mix of Clojure list and sequence manipulation functions, then one
 would have to write private, list/cons-only versions of all of these
 things? That is -- overstating it a bit, to be sure, but perhaps not
 entirely unfairly -- re-implement Clojure's Lisp?
 
 I just did a quick look over clojure/core.clj, and `reverse` is the only
 function which stood out to me as hitting the most pathological case.
 Every other `conj` loop over a user-provided datastructure is `conj`ing
 into an explicit non-list/`Cons` type.
 
 So I think if you replace your calls to `reverse` and any `conj` loops
 you have in your own code, you should see a perfectly reasonable
 speedup.
 
 -Marshall
 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
...and, suddenly, the high-core-count Opterons show us what we wanted and
hoped for. If I increase that range statement to 100 and run it on the
48-core node, it takes 50 seconds (before it took 50 minutes), while the
FX-8350 takes 3:31.89 and the 3770K takes 3:48.95. Thanks Marshall! I think
you may have just saved us outrageous quantities of time, though Lee isn't
convinced that we know exactly what's going on with clojush yet, I don't
think. We haven't looked at it yet though I'm sure we will soon enough!

On Tue, Dec 11, 2012 at 2:57 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote:

 And, interestingly enough, suddenly the AMD FX-8350 beats the Intel Core
 i7 3770K, when before it was very very much not so. So for some reason,
 this bug was tickled more dramatically on AMD multicore processors than on
 Intel ones.


 On Tue, Dec 11, 2012 at 2:54 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote:

 OK WOW. You hit the nail on the head. It's reverse being called in a
 pmap that does it. When I redefine my own version of reverse (I totally
 cheated and just stole this) like this:

 (defn reverse-recursively [coll]
   (loop [[r  more :as all] (seq coll)
  acc '()]
 (if all
   (recur more (cons r acc))
   acc)))



 I speed it up from, get this, 4:32 to 0:25. Yeah.

 In case anybody's curious, pmap and pmapall show identical performance in
 this case.

 Wow.

 -Josiah


 On Tue, Dec 11, 2012 at 1:06 PM, Marshall Bockrath-Vandegrift 
 llas...@gmail.com wrote:

 Lee Spector lspec...@hampshire.edu writes:

  If the application does lots of list processing but does so with a
  mix of Clojure list and sequence manipulation functions, then one
  would have to write private, list/cons-only versions of all of these
  things? That is -- overstating it a bit, to be sure, but perhaps not
  entirely unfairly -- re-implement Clojure's Lisp?

 I just did a quick look over clojure/core.clj, and `reverse` is the only
 function which stood out to me as hitting the most pathological case.
 Every other `conj` loop over a user-provided datastructure is `conj`ing
 into an explicit non-list/`Cons` type.

 So I think if you replace your calls to `reverse` and any `conj` loops
 you have in your own code, you should see a perfectly reasonable
 speedup.

 -Marshall

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en





-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Wm. Josiah Erikson
Hm. Interesting. For the record, the exact code I'm running right now that
I'm seeing great parallelism with is this:

(defn reverse-recursively [coll]
  (loop [[r  more :as all] (seq coll)
 acc '()]
(if all
  (recur more (cons r acc))
  acc)))

(defn burn
  ([] (loop [i 0
 value '()]
(if (= i 1)
  (count (last (take 1 (iterate reverse-recursively value
  (recur (inc i)
 (cons
   (* (int i)
  (+ (float i)
 (- (int i)
(/ (float i)
   (inc (int i))
   value)
  ([_] (burn)))

(defn pmapall
  Like pmap but: 1) coll should be finite, 2) the returned sequence
   will not be lazy, 3) calls to f may occur in any order, to maximize
   multicore processor utilization, and 4) takes only one coll so far.
  [f coll]
  (let [agents (map agent coll)]
(dorun (map #(send % f) agents))
(apply await agents)
(doall (map deref agents

(defn -main
  [ args]
  (time ( doall( pmapall burn (range 100
  (System/exit 0))


On Tue, Dec 11, 2012 at 3:00 PM, Andy Fingerhut andy.finger...@gmail.comwrote:

 Marshall:

 I'm not practiced in recognizing megamorphic call sites, so I could be
 missing some in the example code below, modified from Lee's original code.
  It doesn't use reverse or conj, and as far as I can tell doesn't use
 PersistentList, either, only Cons.

 (defn burn-cons [size]
   (let [size (long size)]
 (loop [i (long 0)
value nil]
   (if (= i size)
 (last value)
 (recur (inc i) (clojure.lang.Cons.
 (* (int i)
(+ (float i)
   (- (int i)
  (/ (float i)
 (inc (int i))
 value))

 (a) invoke (burn-cons 200) sequentially 64 times in a single JVM

 (b) invoke (burn-cons 200) 64 times using a modified version of pmap
 that limits the number of active threads to 2 (see below), in a single JVM.
  I would hope that this would take about half the elapsed time than (a),
 but the elapsed time is longer than (a)

 (c) start up two JVMs simultaneously and invoke (burn-cons 200)
 sequentially 32 times in each.  The elapsed time here is less than (a), as
 I would expect.

 (Clojure 1.4, Oracle/Apple JDK 1.6.0_37, Mac OS X 10.6.8, running on a
 machine with Intel core i7 with 4 physical cores but OS X reports it as 8 I
 think because of 2 hyperthreads per core -- more details available on
 request).

 Can you try to reproduce to see if you get similar results?  If so, do you
 know why we get bad parallelism in a single JVM for this code?  If there
 are no megamorphic call sites, then it is examples like this that lead me
 to wonder about locking in memory allocation and/or GC.


 With the functions below, my part (b) was measured by doing:

 (time (doall (nthreads-pmap 2 (burn-cons 200) (unchunk (range 64)

 Andy



 (defn unchunk [s]
   (when (seq s)
 (lazy-seq
  (cons (first s)
(unchunk (next s))

 (defn nthreads-pmap
   Like pmap, except can take an argument nthreads to control the
   maximum number of parallel threads used.
   ([f coll]
  (let [n (+ 2 (.. Runtime getRuntime availableProcessors))]
(nthreads-pmap n f coll)))
   ([nthreads f coll]
  (if (= nthreads 1)
(map f coll)
(let [n (dec nthreads)
  rets (map #(future (f %)) coll)
  step (fn step [[x  xs :as vs] fs]
 (lazy-seq
  (if-let [s (seq fs)]
(cons (deref x) (step xs (rest s)))
(map deref vs]
  (step rets (drop n rets)
   ([nthreads f coll  colls]
(let [step (fn step [cs]
 (lazy-seq
  (let [ss (map seq cs)]
(when (every? identity ss)
  (cons (map first ss) (step (map rest ss)))]
  (nthreads-pmap nthreads #(apply f %) (step (cons coll colls))


 On Dec 11, 2012, at 10:06 AM, Marshall Bockrath-Vandegrift wrote:

  Lee Spector lspec...@hampshire.edu writes:
 
  If the application does lots of list processing but does so with a
  mix of Clojure list and sequence manipulation functions, then one
  would have to write private, list/cons-only versions of all of these
  things? That is -- overstating it a bit, to be sure, but perhaps not
  entirely unfairly -- re-implement Clojure's Lisp?
 
  I just did a quick look over clojure/core.clj, and `reverse` is the only
  function which stood out to me as hitting the most pathological case.
  Every other `conj` loop over a user-provided datastructure is `conj`ing
  into an explicit non-list/`Cons` type.
 
  So I think if you replace your calls to `reverse` and any `conj` loops
  you have in your own 

Re: abysmal multicore performance, especially on AMD processors

2012-12-11 Thread Lee Spector

On Dec 11, 2012, at 1:06 PM, Marshall Bockrath-Vandegrift wrote:
 So I think if you replace your calls to `reverse` and any `conj` loops
 you have in your own code, you should see a perfectly reasonable
 speedup.

Tantalizing, but on investigation I see that our real application actually does 
very little explicitly with reverse or conj, and I don't actually think that 
we're getting reasonable speedups (which is what led me to try that benchmark). 
So while I'm not sure of the source of the problem in our application I think 
there can be a problem even if one avoids direct calls to reverse and conj. 
Andy's recent tests also seem to confirm this.

BTW benchmarking our real application (https://github.com/lspector/Clojush) is 
a bit tricky because it's riddled with random number generator calls that can 
have big effects, but we're going to look into working around that. Recent 
postings re: seedable RNGs may help, although changing all of the RNG code may 
be a little involved because we use thread-local RNGs (to avoid contention and 
get good multicore speedups... we thought!).

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread cameron
Hi Marshall,
  I think we're definitely on the right track.
If I replace the reverse call with the following function I get a parallel 
speedup of ~7.3 on an 8 core machine.

(defn copy-to-java-list [coll]
  (let [lst (java.util.LinkedList.)]
(doseq [x coll]
  (.addFirst lst x))
lst))

This function should do as much memory allocation as the clojure reverse 
but has vastly better parallel performance.
There does seem to be something unusual about conj and 
clojure.lang.PersistentList in this parallel test case and I don't think 
it's related to the JVMs memory allocation.

Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marko Topolnik
The main GC feature here are the Thread-Local Allocation Buffers. They are 
on by default and are automatically sized according to allocation 
patterns. The size can also be fine-tuned with the -XX:TLABSize=nconfiguration 
option. You may consider tweaking this setting to optimize 
runtime. Basically, everything that one call to your function allocates 
should fit into a TLAB because it is all garbage upon exit. Allocation 
inside TLAB is ultra-fast and completely concurrent.

Configure 
TLABhttp://docs.oracle.com/javase/7/docs/technotes/tools/windows/java.html

On Sunday, December 9, 2012 7:37:09 PM UTC+1, Andy Fingerhut wrote:


 On Dec 9, 2012, at 6:25 AM, Softaddicts wrote: 

  If the number of object allocation mentioned earlier in this thread are 
 real, 
  yes vm heap management can be a bottleneck. There has to be some 
  locking done somewhere otherwise the heap would corrupt :) 
  
  The other bottleneck can come from garbage collection which has to 
 freeze 
  object allocation completely or partially. 
  
  This internal process has to reclaim unreferenced objects otherwise you 
 may end up 
  exhausting the heap. That can even susoend your app while gc is running 
 depending 
  on the strategy used. 

 Agreed that memory allocation and garbage collection will in some cases 
 need to coordinate between threads to work in the general case of arbitrary 
 allocations and GCs. 

 However, one could have a central list of large pages of free memory 
 (e.g. a few MBytes, or maybe even larger), and pass these out to concurrent 
 memory allocators in these large chunks, and let them do small object 
 allocations and GC within each thread completely concurrently. 

 The only times locking of any kind might be needed with such a strategy 
 would be when one of the parallel threads requests a new big page from the 
 central free list, or returned a completely empty free page back to the 
 central list that it didn't need any more.  All other memory allocation and 
 GC could be completely concurrent.  The idea of making those pages large 
 is that such passing pages around would be infrequent, and thus could be 
 made to have no measurable synchronization overhead. 

 That is pretty much what is happening when you run Lee's benchmark 
 programs as 1 thread per JVM, but 4 or 8 different JVMs processes running 
 in parallel.  In that situation the OS has the central free list of pages, 
 and the JVMs manage their small object allocations and GCs completely 
 concurrently without interfering with each other. 

 If HotSpot's JVM could be configured to work like that, he would be seeing 
 big speedups in a single JVM. 

 Andy 



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marshall Bockrath-Vandegrift
cameron cdor...@gmail.com writes:

 There does seem to be something unusual about conj and
 clojure.lang.PersistentList in this parallel test case and I don't
 think it's related to the JVMs memory allocation.

I’ve got a few more data-points, but still no handle on what exactly is
going on.

My last benchmark showing the `conj*` speedup for `Cons` objects
degrading as soon as it was used on a `PersistantList` was incomplete.
In fact, the speedup degrades after it is used on objects of more than
one type.  The effect just appears immediately when used with
`PersistantList` because '() is in fact a different a
`PersistantList$EmptyList`.  Using `conj*` first in vector
implementation then results in the same inverse speedup on `Cons`s.

Even without your near-optimal speedup using Java standard library
types, I think your earlier benchmarks are enough to demonstrate that
this isn’t an issue with allocation alone.  All of the implementations
based on `reduce` with `conj` must allocate and return a new object for
each iteration.  If parallel allocation were the sole issue, I’d expect
all of the implementations to demonstrate the same behavior.

Unfortunately I have no idea what to connect from these facts:

  - Parallel allocation of `Cons` and `PersistentList` instances through
a Clojure `conj` function remains fast as long as the function only
ever returns objects of a single concrete type

  - Parallel allocation speed for `PersistentVector` instances is
unaffected by `conj` returning multiple types, and does not
demonstrate the inverse speedup seen for the previous types.

At this point I believe the symptoms point to cache contention, but I
don’t know where or why.  Using OpenJDK 7 with -XX:+UseCondMark didn’t
appear to produce any improvement.  Creating a private copy of
`PersistentList` which contained additional padding fields likewise
didn’t appear to produce any improvement.

So, Lee Spector: I think it’s possible to work around this though by
just not using `conj` on lists.  It’s suboptimal, but at least solves
the problem in your original benchmark.  Further improvements are
obviously possible, but that’s a start.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread meteorfox


  - Parallel allocation of `Cons` and `PersistentList` instances through 
 a Clojure `conj` function remains fast as long as the function only 
 ever returns objects of a single concrete type 


A possible explanation for this could be JIT Deoptimization. Deoptimization 
happens when the JIT optimized a hot code path based
on what has learned from the executing program, but no longer holds, then 
the compiler notices it has made an incorrect decision in
the previous optimization, and performs the deoptimization.

It might be the case that after a deoptimization, the JIT compiler 
reoptimizes a code path the was previously optimized. This frequently 
happens
when dealing different implementations of an interface, and/or 
inheritance and type hierarchy due to the code path execution changing 
constantly and
having to go through multiple optimizations and deoptimizations.

To identify whether deoptimization is happening, you can add the following 
flag to jvm args -XX:+PrintCompilation 
If the output contains prints made not entrant then that indicates a 
deoptimization, and that method will be interpreted until
a certain threshold of executions is surpassed, and get optimized again.


Reference:
Robust Java Benchmarking, 
http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html#do
Java Performance, Charlie Hunt and  Binu John

On Monday, December 10, 2012 6:17:47 AM UTC-5, Marshall Bockrath-Vandegrift 
wrote:

 cameron cdo...@gmail.com javascript: writes: 

  There does seem to be something unusual about conj and 
  clojure.lang.PersistentList in this parallel test case and I don't 
  think it's related to the JVMs memory allocation. 

 I’ve got a few more data-points, but still no handle on what exactly is 
 going on. 

 My last benchmark showing the `conj*` speedup for `Cons` objects 
 degrading as soon as it was used on a `PersistantList` was incomplete. 
 In fact, the speedup degrades after it is used on objects of more than 
 one type.  The effect just appears immediately when used with 
 `PersistantList` because '() is in fact a different a 
 `PersistantList$EmptyList`.  Using `conj*` first in vector 
 implementation then results in the same inverse speedup on `Cons`s. 

 Even without your near-optimal speedup using Java standard library 
 types, I think your earlier benchmarks are enough to demonstrate that 
 this isn’t an issue with allocation alone.  All of the implementations 
 based on `reduce` with `conj` must allocate and return a new object for 
 each iteration.  If parallel allocation were the sole issue, I’d expect 
 all of the implementations to demonstrate the same behavior. 

 Unfortunately I have no idea what to connect from these facts: 

   - Parallel allocation of `Cons` and `PersistentList` instances through 
 a Clojure `conj` function remains fast as long as the function only 
 ever returns objects of a single concrete type 

   - Parallel allocation speed for `PersistentVector` instances is 
 unaffected by `conj` returning multiple types, and does not 
 demonstrate the inverse speedup seen for the previous types. 

 At this point I believe the symptoms point to cache contention, but I 
 don’t know where or why.  Using OpenJDK 7 with -XX:+UseCondMark didn’t 
 appear to produce any improvement.  Creating a private copy of 
 `PersistentList` which contained additional padding fields likewise 
 didn’t appear to produce any improvement. 

 So, Lee Spector: I think it’s possible to work around this though by 
 just not using `conj` on lists.  It’s suboptimal, but at least solves 
 the problem in your original benchmark.  Further improvements are 
 obviously possible, but that’s a start. 

 -Marshall 



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
Aha. Not only do I get a lot of made not entrant, I get a lot of made
zombie. However, I get this for both runs with map and with pmap (and with
pmapall as well)

 For instance, from a pmapall run:

33752  159 clojure.lang.Cons::next (10 bytes)   made zombie
  33752  164 clojure.lang.RT::conj (21 bytes)   made zombie
  33753  154 clojure.lang.RT::seq (32 bytes)   made zombie
  337535 %   clojure.core$reduce1::invoke @ -2 (184 bytes)
made zombie
  33753  167 clojure.core$conj::invoke (13 bytes)   made zombie
  33753  184 clojure.core$rest::invoke (7 bytes)
  33884  186 clojure.lang.Numbers$LongOps::isPos (15 bytes)
  34298  187 clojure.lang.Numbers::dec (17 bytes)
  34421  188 clojure.lang.Numbers$LongOps::dec (13 bytes)
  34897  189 clojure.core$take::invoke (24 bytes)
  34903  190 clojure.core$take$fn__4227::init (15 bytes)
  35386  191 clojure.core$iterate::invoke (39 bytes)
SNIP

  1168  154 clojure.lang.RT::seq (32 bytes)   made not entrant
   1169  168 clojure.lang.RT::seq (32 bytes)
   11715 %   clojure.core$reduce1::invoke @ -2 (184 bytes)
made not entrant
   1171  169 clojure.core$reduce1::invoke (184 bytes)
   1173  159 clojure.lang.Cons::next (10 bytes)   made not
entrant
   1173  167 clojure.core$conj::invoke (13 bytes)   made not
entrant
   1173  164 clojure.lang.RT::conj (21 bytes)   made not entrant
   1192  170 clojure.lang.PersistentList::first (5 bytes)
   1193  171 clojure.lang.PersistentList::next (18 bytes)
   1193  172 clojure.lang.Cons::next (10 bytes)
   1194  173 clojure.lang.RT::conj (21 bytes)
   1197  174 clojure.core$conj::invoke (13 bytes)
   12336 %   clojure.core$reduce1::invoke @ 0 (184 bytes)

And then, from a map run:

  1163  151 clojure.lang.RT::seq (32 bytes)   made not entrant
   1163  145 clojure.core$cons::invoke (10 bytes)   made not
entrant
   1163  168 clojure.lang.RT::seq (32 bytes)
   11654 %   clojure.core$reduce1::invoke @ 0 (184 bytes)
   1169  144 clojure.lang.RT::cons (46 bytes)   made not entrant
   3467  169 clojure.lang.RT::cons (46 bytes)
   3470   24 clojure.lang.Util::equiv (65 bytes)   made zombie
   3470   23 java.lang.String::equals (88 bytes)   made zombie
   3470   18 clojure.lang.PersistentArrayMap::createWithCheck
(80 bytes)   made zombie
   3470   26 clojure.lang.PersistentArrayMap::equalKey (6
bytes)   made zombie
   3617  170 clojure.core$cons::invoke (10 bytes)
   3622   41 clojure.lang.PersistentArrayMap::indexOf (34
bytes)   made zombie
   3622   30   ! java.net.URL::init (543 bytes)   made zombie
   3622   58 clojure.lang.Symbol::equals (49 bytes)   made
zombie
   3623   73 java.lang.Object::equals (11 bytes)   made zombie
   3623   65 clojure.lang.Util::hasheq (43 bytes)   made zombie
   4249  171  s! clojure.lang.LazySeq::sval (54 bytes)
   4259   77 clojure.lang.Util::equiv (65 bytes)   made zombie
   4259   89 clojure.lang.RT::first (35 bytes)   made zombie
   4259   88 clojure.lang.PersistentHashMap$NodeSeq::create (94
bytes)   made zombie
   4578  172 n   java.lang.Object::getClass (0 bytes)
   5380  173 clojure.lang.AFunction::init (5 bytes)
   5634  174 java.lang.Long::longValue (5 bytes)
   5785  175  s  clojure.lang.LazySeq::seq (53 bytes)
   5830  176 clojure.lang.Numbers::ops (97 bytes)
   6168  177 clojure.lang.LazySeq::init (10 bytes)
   6169   32 java.lang.AbstractStringBuilder::append (48
bytes)   made zombie
  10727  178 java.lang.Long::valueOf (36 bytes)
  10730   37 java.lang.StringBuilder::append (8 bytes)   made
zombie
  10730   49   ! sun.misc.URLClassPath$JarLoader::getResource (91
bytes)   made zombie
  10730   44 sun.misc.URLClassPath::getResource (74 bytes)
made zombie
  11121  179 clojure.lang.Numbers::num (5 bytes)
  11240  180 clojure.core$rest::invoke (7 bytes)
  11240  181 clojure.lang.RT::more (37 bytes)
  11242  144 clojure.lang.RT::cons (46 bytes)   made zombie
  11242  145 clojure.core$cons::invoke (10 bytes)   made zombie
  11242  151 clojure.lang.RT::seq (32 bytes)   made zombie




On Mon, Dec 10, 2012 at 11:21 AM, meteorfox
ctorresk8guitar@gmail.comwrote:

 -XX:+PrintCompilation

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - 

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
I tried some more performance tuning options in Java, just for kicks, and
didn't get any advantages from them: -server -XX:+TieredCompilation
-XX:ReservedCodeCacheSize=256m

Also, in case it's informative:

[josiah@compute-1-17 benchmark]$ grep entrant compilerOutputCompute-1-1.txt
| wc -l
173
[josiah@compute-1-17 benchmark]$ grep entrant
compilerOutputCompute-1-17.txt | wc -l
178
[josiah@compute-1-17 benchmark]$ grep zombie compilerOutputCompute-1-17.txt
| wc -l
163
[josiah@compute-1-17 benchmark]$ grep zombie compilerOutputCompute-1-1.txt
| wc -l
158
[josiah@compute-1-17 benchmark]$




On Mon, Dec 10, 2012 at 12:55 PM, Wm. Josiah Erikson wmjos...@gmail.comwrote:

 Aha. Not only do I get a lot of made not entrant, I get a lot of made
 zombie. However, I get this for both runs with map and with pmap (and with
 pmapall as well)

  For instance, from a pmapall run:

 33752  159 clojure.lang.Cons::next (10 bytes)   made zombie
   33752  164 clojure.lang.RT::conj (21 bytes)   made zombie
   33753  154 clojure.lang.RT::seq (32 bytes)   made zombie
   337535 %   clojure.core$reduce1::invoke @ -2 (184 bytes)
 made zombie
   33753  167 clojure.core$conj::invoke (13 bytes)   made zombie
   33753  184 clojure.core$rest::invoke (7 bytes)
   33884  186 clojure.lang.Numbers$LongOps::isPos (15 bytes)
   34298  187 clojure.lang.Numbers::dec (17 bytes)
   34421  188 clojure.lang.Numbers$LongOps::dec (13 bytes)
   34897  189 clojure.core$take::invoke (24 bytes)
   34903  190 clojure.core$take$fn__4227::init (15 bytes)
   35386  191 clojure.core$iterate::invoke (39 bytes)
 SNIP

   1168  154 clojure.lang.RT::seq (32 bytes)   made not entrant
1169  168 clojure.lang.RT::seq (32 bytes)
11715 %   clojure.core$reduce1::invoke @ -2 (184 bytes)
 made not entrant
1171  169 clojure.core$reduce1::invoke (184 bytes)
1173  159 clojure.lang.Cons::next (10 bytes)   made not
 entrant
1173  167 clojure.core$conj::invoke (13 bytes)   made not
 entrant
1173  164 clojure.lang.RT::conj (21 bytes)   made not
 entrant
1192  170 clojure.lang.PersistentList::first (5 bytes)
1193  171 clojure.lang.PersistentList::next (18 bytes)
1193  172 clojure.lang.Cons::next (10 bytes)
1194  173 clojure.lang.RT::conj (21 bytes)
1197  174 clojure.core$conj::invoke (13 bytes)
12336 %   clojure.core$reduce1::invoke @ 0 (184 bytes)

 And then, from a map run:

   1163  151 clojure.lang.RT::seq (32 bytes)   made not entrant
1163  145 clojure.core$cons::invoke (10 bytes)   made not
 entrant
1163  168 clojure.lang.RT::seq (32 bytes)
11654 %   clojure.core$reduce1::invoke @ 0 (184 bytes)
1169  144 clojure.lang.RT::cons (46 bytes)   made not
 entrant
3467  169 clojure.lang.RT::cons (46 bytes)
3470   24 clojure.lang.Util::equiv (65 bytes)   made zombie
3470   23 java.lang.String::equals (88 bytes)   made zombie
3470   18 clojure.lang.PersistentArrayMap::createWithCheck
 (80 bytes)   made zombie
3470   26 clojure.lang.PersistentArrayMap::equalKey (6
 bytes)   made zombie
3617  170 clojure.core$cons::invoke (10 bytes)
3622   41 clojure.lang.PersistentArrayMap::indexOf (34
 bytes)   made zombie
3622   30   ! java.net.URL::init (543 bytes)   made zombie
3622   58 clojure.lang.Symbol::equals (49 bytes)   made
 zombie
3623   73 java.lang.Object::equals (11 bytes)   made zombie
3623   65 clojure.lang.Util::hasheq (43 bytes)   made zombie
4249  171  s! clojure.lang.LazySeq::sval (54 bytes)
4259   77 clojure.lang.Util::equiv (65 bytes)   made zombie
4259   89 clojure.lang.RT::first (35 bytes)   made zombie
4259   88 clojure.lang.PersistentHashMap$NodeSeq::create
 (94 bytes)   made zombie
4578  172 n   java.lang.Object::getClass (0 bytes)
5380  173 clojure.lang.AFunction::init (5 bytes)
5634  174 java.lang.Long::longValue (5 bytes)
5785  175  s  clojure.lang.LazySeq::seq (53 bytes)
5830  176 clojure.lang.Numbers::ops (97 bytes)
6168  177 clojure.lang.LazySeq::init (10 bytes)
6169   32 java.lang.AbstractStringBuilder::append (48
 bytes)   made zombie
   10727  178 java.lang.Long::valueOf (36 bytes)
   10730   37 java.lang.StringBuilder::append (8 bytes)   made
 zombie
   10730   49   ! sun.misc.URLClassPath$JarLoader::getResource (91
 bytes)   made zombie
   10730   44 

Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Marshall Bockrath-Vandegrift
Wm. Josiah Erikson wmjos...@gmail.com writes:

 Aha. Not only do I get a lot of made not entrant, I get a lot of
 made zombie. However, I get this for both runs with map and with
 pmap (and with pmapall as well)

I’m not sure this is all that enlightening.  From what I can gather,
“made not entrant” just means that a JITed version proved to be invalid
in light of later code and new invocation paths won’t use the previous
version.  And “made zombie” just means all references to an old JIT’d
version have been lost, making it available to be GC’ed.

A copy of `conj` becomes “not entrant” after being used on both vectors
and lists, but the new version gets the same speed-up when used on
vectors as a copy which has only been used on vectors.  There’s
something else going on which is specifically affecting parallel calls
to the polymorphic version when applied to instances of
`PersistentList`, and `Cons`.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-10 Thread Wm. Josiah Erikson
Interesting. I tried the following:
:jvm-opts [-Xmx10g -Xms10g -XX:+AggressiveOpts -server
-XX:+TieredCompilation -XX:ReservedCodeCacheSize=256m -XX:TLABSize=1G
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseParNewGC
-XX:+ResizeTLAB -XX:+UseTLAB]

I got a slight slowdown, and the GC details are as follows:

[josiah@compute-1-1 benchmark]$ /usr/bin/time -f %E lein run
0.852: [GC 0.852: [ParNew: 2796224K-6726K(3145728K), 0.0136610 secs]
2796224K-6726K(10136256K), 0.0136930 secs] [Times: user=0.05 sys=0.02,
real=0.01 secs]
0.871: [GC 0.871: [ParNew: 2802950K-8104K(3145728K), 0.0116360 secs]
2802950K-8104K(10136256K), 0.0116720 secs] [Times: user=0.06 sys=0.01,
real=0.01 secs]
0.890: [GC 0.890: [ParNew: 2804328K-11538K(3145728K), 0.0112460 secs]
2804328K-11538K(10136256K), 0.0112720 secs] [Times: user=0.08 sys=0.00,
real=0.01 secs]
0.904: [GC 0.904: [ParNew: 2807762K-11752K(3145728K), 0.0092300 secs]
2807762K-11752K(10136256K), 0.0092550 secs] [Times: user=0.06 sys=0.00,
real=0.01 secs]
0.915: [GC 0.915: [ParNew: 2807976K-10702K(3145728K), 0.0072210 secs]
2807976K-10702K(10136256K), 0.0072480 secs] [Times: user=0.06 sys=0.00,
real=0.01 secs]
0.969: [GC 0.969: [ParNew: 2806926K-12249K(3145728K), 0.0206880 secs]
2806926K-12249K(10136256K), 0.0207160 secs] [Times: user=0.13 sys=0.01,
real=0.02 secs]
21.099: [GC 21.099: [ParNew: 2808473K-14256K(3145728K), 0.0174230 secs]
2808473K-14256K(10136256K), 0.0174580 secs] [Times: user=0.12 sys=0.00,
real=0.02 secs]
46.533: [GC 46.533: [ParNew: 2810480K-10070K(3145728K), 0.0097840 secs]
2810480K-10070K(10136256K), 0.0098140 secs] [Times: user=0.08 sys=0.00,
real=0.01 secs]
74.988: [GC 74.988: [ParNew: 2806294K-11576K(3145728K), 0.0134020 secs]
2806294K-11576K(10136256K), 0.0134330 secs] [Times: user=0.08 sys=0.00,
real=0.02 secs]
105.143: [GC 105.143: [ParNew: 2807800K-12728K(3145728K), 0.0121870 secs]
2807800K-12728K(10136256K), 0.0122240 secs] [Times: user=0.08 sys=0.00,
real=0.02 secs]
136.170: [GC 136.170: [ParNew: 2808952K-13336K(3145728K), 0.0144400 secs]
2808952K-13336K(10136256K), 0.0144720 secs] [Times: user=0.09 sys=0.00,
real=0.01 secs]
167.703: [GC 167.703: [ParNew: 2809560K-14763K(3145728K), 0.0105520 secs]
2809560K-14763K(10136256K), 0.0105830 secs] [Times: user=0.07 sys=0.00,
real=0.01 secs]
199.593: [GC 199.593: [ParNew: 2810987K-11407K(3145728K), 0.0142030 secs]
2810987K-11407K(10136256K), 0.0142350 secs] [Times: user=0.08 sys=0.00,
real=0.01 secs]
231.894: [GC 231.894: [ParNew: 2807631K-15066K(3145728K), 0.0129290 secs]
2807631K-15066K(10136256K), 0.0129630 secs] [Times: user=0.10 sys=0.01,
real=0.01 secs]
264.239: [GC 264.239: [ParNew: 2811290K-9632K(3145728K), 0.0119130 secs]
2811290K-9632K(10136256K), 0.0119850 secs] [Times: user=0.08 sys=0.00,
real=0.01 secs]
Elapsed time: 291038.415325 msecs
Heap
 par new generation   total 3145728K, used 2700935K [0x00057ae0,
0x00065035, 0x00065035)
  eden space 2796224K,  96% used [0x00057ae0, 0x00061f239bb0,
0x0006258b)
  from space 349504K,   2% used [0x00063ae0, 0x00063b768340,
0x00065035)
  to   space 349504K,   0% used [0x0006258b, 0x0006258b,
0x00063ae0)
 tenured generation   total 6990528K, used 0K [0x00065035,
0x0007fae0, 0x0007fae0)
   the space 6990528K,   0% used [0x00065035, 0x00065035,
0x000650350200, 0x0007fae0)
 compacting perm gen  total 21248K, used 11049K [0x0007fae0,
0x0007fc2c, 0x0008)
   the space 21248K,  52% used [0x0007fae0, 0x0007fb8ca638,
0x0007fb8ca800, 0x0007fc2c)
No shared spaces configured.
4:53.06
[josiah@compute-1-1 benchmark]$



On Mon, Dec 10, 2012 at 3:33 PM, Marshall Bockrath-Vandegrift 
llas...@gmail.com wrote:

 Wm. Josiah Erikson wmjos...@gmail.com writes:

  Aha. Not only do I get a lot of made not entrant, I get a lot of
  made zombie. However, I get this for both runs with map and with
  pmap (and with pmapall as well)

 I’m not sure this is all that enlightening.  From what I can gather,
 “made not entrant” just means that a JITed version proved to be invalid
 in light of later code and new invocation paths won’t use the previous
 version.  And “made zombie” just means all references to an old JIT’d
 version have been lost, making it available to be GC’ed.

 A copy of `conj` becomes “not entrant” after being used on both vectors
 and lists, but the new version gets the same speed-up when used on
 vectors as a copy which has only been used on vectors.  There’s
 something else going on which is specifically affecting parallel calls
 to the polymorphic version when applied to instances of
 `PersistentList`, and `Cons`.

 -Marshall

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To 

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Jim - FooBar();

Hi Lee,


Would it be difficult to try the following version of 'pmap'? It doesn't 
use futures but executors instead so at least this could help narrow the 
problem down... If the problem is due to the high number of futures 
spawned by pmap then this should fix it...


(defn- with-thread-pool* [num-threads f]
  (let [pool (java.util.concurrent.Executors/newFixedThreadPool 
num-threads)]

(try (f pool)
  (finally
(when pool (.shutdown pool))

(defmacro with-thread-pool [[name num-threads]  body]
  `(with-thread-pool* ~num-threads (fn [~name] ~@body)))

(defn pmapp [f coll]
  (with-thread-pool [pool (.. Runtime getRuntime availableProcessors)]
(doall
  (map #(.get %)
(.invokeAll pool
  (map (partial partial f)  coll))

And btw, reducers are great but usually you need to reformulate your 
problem..r/map is not parallel. It is serial but with no intermediate 
allocation (unlike lazy-seqs). If you want to make it parallel you need 
to r/fold the result of r/map (the reducer) providing  reducing  
combining fns. However, is still doesn't make sense to be slower than 
map...If you 've followed Rich's example with the pie-maker consider this:


Both map  r/map will return immediately. The difference is that map 
will build a recipe for the first item whereas r/map will build a recipe 
for the entire coll. In other words, the lazy-pie-maker assistant 
provides 1 apple at a time (after asking for an apple) but the 
reducer-pie-maker-assistant, as soon as you ask for the 1st apple it 
will do the entire bag (without intermediate 'asking'). The lazy recipe 
is recursive whereas the reducer-based one looks like a stream...It 
should be still faster, albeit not terribly faster. You need r/fold to 
see any parallelism...



Jim






On 09/12/12 05:19, Lee Spector wrote:

On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote:

Yep, reducers, don't use lazy seqs. But they return just sth. like transformed 
functions, that will be applied when building the collection. So you can use 
them like this:

 (into [] (r/map burn (doall (range 4)

See 
http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
and http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html for more info...


Thanks Marek. This does fix the too quick to be true issue, but alas, on my 
mac laptop (a 4 core intel i7):

57522.869 msecs for (time (into [] (r/map burn (doall (range 4)

58263.312 msecs for (time (doall (map burn (doall (range 4)

So while I'm not getting a terrible slowdown from using the reducers version of 
map, I'm also not getting any speedup over the single-thread map.

We should try this on our other architectures too, but it doesn't look 
promising.

  -Lee

--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspec...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438



--
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Marshall Bockrath-Vandegrift
cameron cdor...@gmail.com writes:

 Interesting problem, the slowdown seems to being caused by the reverse
 call (actually the calls to conj with a list argument).

Excellent analysis, sir!  I think this points things in the right
direction.

 fast-reverse    : map-ms: 3.3, pmap-ms 0.7, speedup 4.97
 list-cons   : map-ms: 4.0, pmap-ms 0.7, speedup 6.13

The difference between these two I believe is just jitter.  Once `cons`
is called on either a list or a vector, the result is a
`clojure.lang.Cons` object.

 vec-conj    : map-ms: 4.0, pmap-ms 1.3, speedup 3.10

For the sub-goal of optimizing `reverse`, I get better times even than
for the `cons`-based implementation by using transient vectors:

list-cons : map-ms: 4.0, pmap-ms 0.6, speedup 6.42
tvec-conj : map-ms: 0.9, pmap-ms 0.2, speedup 4.10

(defn tvec-conj [coll]
  (persistent! (reduce conj! (transient []) coll)))

 list-conj   : map-ms: 10.8, pmap-ms 21.2, speedup 0.51
 clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is
 equivalent to the original code)

I add the following:

cons-conj : map-ms: 3.3, pmap-ms 16.8, speedup 0.19

(defn cons-conj [coll]
  (reduce conj (clojure.lang.Cons. (first coll) nil) (rest coll)))

I think this is the key, but I don’t understand it.

The `cons` function just immediately creates a new `Cons` object.  The
`conj` function calls the `.cons` method of the collection, and the
`.cons` implementation `Cons` inherits from `ASeq` just creates a new
`Cons` object!

It’s like there’s a lock of some sort sneaking in on the `conj` path.
Any thoughts on what that could be?

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Softaddicts
If the number of object allocation mentioned earlier in this thread are real,
yes vm heap management can be a bottleneck. There has to be some
locking done somewhere otherwise the heap would corrupt :)

The other bottleneck can come from garbage collection which has to freeze 
object allocation completely or partially.

This internal process has to reclaim unreferenced objects otherwise you may end 
up 
exhausting the heap. That can even susoend your app while gc is running 
depending
on the strategy used.

In java 6, the vm is suppose to adjust the gc strategy by looking at your app
behavior. That may take some time to profile however. If your app has this heavy
memory requirement abruptely, the vm may not have enough data to adapt.

Have a look here, you will find how to set explicitly the gc behavior:

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

This is specific to java 6, java 5 is a different beast as java 7.

Maybe this can help, I emphasize,  *can help* not having looked deeply into
your problem.

Luc P.


 
 On Dec 8, 2012, at 10:19 PM, meteorfox wrote:
  
  Now if you run vmstat 1 while running your benchmark you'll notice that the 
  run queue will be most of the time at 8, meaning that 8 processes are 
  waiting for CPU, and this is due to memory accesses (in this case, since 
  this is not true for all applications).
  
  So, I feel your benchmark may be artificial and does not truly represent 
  your real application, make sure to profile your real application, and 
  optimize according to the bottlenecks. There are really useful tools out 
  there 
  for profiling, such as VisualVM, perf.
 
 Thanks Carlos.
 
 I don't actually think that the benchmark is particularly artificial. It's 
 very difficult to profile my actual application because it uses random 
 numbers all over the place and is highly and nonlinearly variable in lots of 
 ways. But I think that the benchmark I'm running really is pretty 
 representative.
 
 In any event, WHY would all of that waiting be happening? Logically, nothing 
 should have to be waiting for anything else. We know from the fact that we 
 get good speedups from multiple simultaneous JVM runs, each just running one 
 call to my burn function, that the hardware is capable of performing well 
 with multiple instances of this benchmark running concurrently. It MUST be 
 able to handle all of the memory allocation and everything else. It's just 
 when we try to launch them all in parallel from the same Clojure process, 
 using pmap OR agents OR reducers, that we fail to get the concurrency 
 speedups. Why should this be? And what can be done about it?
 
 I know nearly nothing about the internals of the JVM, but is there perhaps a 
 bottleneck on memory allocation because there's a single serial allocator? 
 Perhaps when we run multiple JVMs each has its own allocator so we don't have 
 the bottleneck? If all of this makes sense and is true, then perhaps (wishful 
 thinking) there's a JVM option like use parallel memory allocators that 
 will fix it?!
 
 Thanks,
 
  -Lee
  
 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 
--
Softaddictslprefonta...@softaddicts.ca sent by ibisMail from my ipad!

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut
On Dec 8, 2012, at 9:37 PM, Lee Spector wrote:

 
 On Dec 8, 2012, at 10:19 PM, meteorfox wrote:
 
 Now if you run vmstat 1 while running your benchmark you'll notice that the 
 run queue will be most of the time at 8, meaning that 8 processes are 
 waiting for CPU, and this is due to memory accesses (in this case, since 
 this is not true for all applications).
 
 So, I feel your benchmark may be artificial and does not truly represent 
 your real application, make sure to profile your real application, and 
 optimize according to the bottlenecks. There are really useful tools out 
 there 
 for profiling, such as VisualVM, perf.
 
 Thanks Carlos.
 
 I don't actually think that the benchmark is particularly artificial. It's 
 very difficult to profile my actual application because it uses random 
 numbers all over the place and is highly and nonlinearly variable in lots of 
 ways. But I think that the benchmark I'm running really is pretty 
 representative.
 
 In any event, WHY would all of that waiting be happening? Logically, nothing 
 should have to be waiting for anything else. We know from the fact that we 
 get good speedups from multiple simultaneous JVM runs, each just running one 
 call to my burn function, that the hardware is capable of performing well 
 with multiple instances of this benchmark running concurrently. It MUST be 
 able to handle all of the memory allocation and everything else. It's just 
 when we try to launch them all in parallel from the same Clojure process, 
 using pmap OR agents OR reducers, that we fail to get the concurrency 
 speedups. Why should this be? And what can be done about it?
 
 I know nearly nothing about the internals of the JVM, but is there perhaps a 
 bottleneck on memory allocation because there's a single serial allocator? 
 Perhaps when we run multiple JVMs each has its own allocator so we don't have 
 the bottleneck? If all of this makes sense and is true, then perhaps (wishful 
 thinking) there's a JVM option like use parallel memory allocators that 
 will fix it?!

Lee:

I don't know yet know how to get good speedups with this workload in the Oracle 
JVM, although it might be possible with options I'm unaware of.

Azul's Zing JVM has a different memory allocator and GC implementation that 
might be better tuned for parallel workloads.  I haven't used it myself yet -- 
this is just from hearing about it in the past and a brief scan of their web 
site.  They have free trials available.  Maybe you could try that to see if it 
gives you better results out of the box, or with minor tweaking of parameters?

I don't know the cost, but like many companies they might have significant 
discounts for educational customers.

Andy


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut

On Dec 9, 2012, at 4:48 AM, Marshall Bockrath-Vandegrift wrote:
 
 It’s like there’s a lock of some sort sneaking in on the `conj` path.
 Any thoughts on what that could be?

My current best guess is the JVM's memory allocator, not Clojure code.

Andy


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Marshall Bockrath-Vandegrift
Andy Fingerhut andy.finger...@gmail.com writes:

 My current best guess is the JVM's memory allocator, not Clojure code.

I didn’t mean to imply the problem was in Clojure itself, but I don’t
believe the issue is in the memory allocator either.  I now believe the
problem is in a class of JIT optimization HotSpot is performing which
turns into a “pessimization” in the parallel case.

I have the following code, taking the structure from Cameron’s
benchmarks, but replacing the exact functions being tested:

https://gist.github.com/4246320

Note that `conj*` is simply a copy-paste of the `conj` implementation
from clojure/core.clj.  The benchmark runs reducing that `conj*`
function once on `Cons`s, then on `PersistantList`s, then again on
`Cons`s.  

And here are the results:

cons-conj* : map-ms: 5.6, pmap-ms 1.1, speedup 5.08
list-conj* : map-ms: 10.1, pmap-ms 15.9, speedup 0.63
cons-conj* : map-ms: 10.0, pmap-ms 15.6, speedup 0.64

The function performs fine on `Cons` objects, but once applied to
`PersistantList` objects, `conj*` somehow becomes “tainted” and acquires
the inverse speedup factor.  When I run with -XX:+PrintCompilation I
don’t see anything particularly suspicious, but I’m not well-versed in
the art of interpreting HotSpot’s entrails.

My idea for next steps is to create a copy of the PersistantList class
and selectively modify it in an attempt to identify what aspect of it
causes the JVM to produce this behavior.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Andy Fingerhut

On Dec 9, 2012, at 6:25 AM, Softaddicts wrote:

 If the number of object allocation mentioned earlier in this thread are real,
 yes vm heap management can be a bottleneck. There has to be some
 locking done somewhere otherwise the heap would corrupt :)
 
 The other bottleneck can come from garbage collection which has to freeze 
 object allocation completely or partially.
 
 This internal process has to reclaim unreferenced objects otherwise you may 
 end up 
 exhausting the heap. That can even susoend your app while gc is running 
 depending
 on the strategy used.

Agreed that memory allocation and garbage collection will in some cases need to 
coordinate between threads to work in the general case of arbitrary allocations 
and GCs.

However, one could have a central list of large pages of free memory (e.g. a 
few MBytes, or maybe even larger), and pass these out to concurrent memory 
allocators in these large chunks, and let them do small object allocations and 
GC within each thread completely concurrently.

The only times locking of any kind might be needed with such a strategy would 
be when one of the parallel threads requests a new big page from the central 
free list, or returned a completely empty free page back to the central list 
that it didn't need any more.  All other memory allocation and GC could be 
completely concurrent.  The idea of making those pages large is that such 
passing pages around would be infrequent, and thus could be made to have no 
measurable synchronization overhead.

That is pretty much what is happening when you run Lee's benchmark programs as 
1 thread per JVM, but 4 or 8 different JVMs processes running in parallel.  In 
that situation the OS has the central free list of pages, and the JVMs manage 
their small object allocations and GCs completely concurrently without 
interfering with each other.

If HotSpot's JVM could be configured to work like that, he would be seeing big 
speedups in a single JVM.

Andy

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-09 Thread Softaddicts
There's no magic here, everyone tuning their app hit this wall eventually,
tweaking the JVM memory options :)

Luc


 
 On Dec 9, 2012, at 6:25 AM, Softaddicts wrote:
 
  If the number of object allocation mentioned earlier in this thread are 
  real,
  yes vm heap management can be a bottleneck. There has to be some
  locking done somewhere otherwise the heap would corrupt :)
  
  The other bottleneck can come from garbage collection which has to freeze 
  object allocation completely or partially.
  
  This internal process has to reclaim unreferenced objects otherwise you may 
  end up 
  exhausting the heap. That can even susoend your app while gc is running 
  depending
  on the strategy used.
 
 Agreed that memory allocation and garbage collection will in some cases need 
 to coordinate between threads to work in the general case of arbitrary 
 allocations and GCs.
 
 However, one could have a central list of large pages of free memory (e.g. 
 a few MBytes, or maybe even larger), and pass these out to concurrent memory 
 allocators in these large chunks, and let them do small object allocations 
 and GC within each thread completely concurrently.
 
 The only times locking of any kind might be needed with such a strategy would 
 be when one of the parallel threads requests a new big page from the central 
 free list, or returned a completely empty free page back to the central list 
 that it didn't need any more.  All other memory allocation and GC could be 
 completely concurrent.  The idea of making those pages large is that such 
 passing pages around would be infrequent, and thus could be made to have no 
 measurable synchronization overhead.
 
 That is pretty much what is happening when you run Lee's benchmark programs 
 as 1 thread per JVM, but 4 or 8 different JVMs processes running in parallel. 
  In that situation the OS has the central free list of pages, and the JVMs 
 manage their small object allocations and GCs completely concurrently without 
 interfering with each other.
 
 If HotSpot's JVM could be configured to work like that, he would be seeing 
 big speedups in a single JVM.
 
 Andy
 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 
--
Softaddictslprefonta...@softaddicts.ca sent by ibisMail from my ipad!

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Jim - FooBar();
Even though this is very surprising (and sad) to hear, I'm afraid I've 
got different experiences... My reducer-based parallel minimax is about 
3x faster than the serial one, on my 4-core AMD phenom II and a tiny bit 
faster on my girlfriend's intel i5 (2 physical cores + 2 virtual). I'm 
suspecting due to slightly larger L3 cache... So no complaints on my 
part from this...


Now, with regards to pmapThe only occassions I'm finding pmap useful 
is for coarse-grain concurrency...In other words I'm only using pmap 
where i 'd use Executors. Usually, when mapping a very expensive fn 
(that needs no coordination) across a very large collection of roughly 
equally-sized elements. In these cases I'm getting proper linear speedup 
and i don't see why wouldn't I. If you're familiar with text-mining (I 
think you are) consider annotating a collection of documents (500 - 
1000) based on some dictionaries and probabilistic taggers. This could 
take more than 15 seconds per document so pmap does pay off as you would 
expect.


I've not done any experiments on any clusters with reducers but i 
wouldn't expect them to perform well when distributed across several 
nodes. Basically you 'd have horrible locality for something that 
requires coordination (like minimax). However, if you could run 
something like a genetic-algorithm across several nodes and minimax on 
each node then I'd expect good data locality thus tremendous speedup. 
This is exactly what I'm planning of doing for training my chess 
neural-net. Using 20-30 8core machines would be ideal for this use-case 
(20-30 individuals in the population - each lifetime (5 games) running 
on a separate machine that is entirely devoted to minimax)...


Anyway, like you, i am baffled by your experiences...You say that 
monitoring cpus shows that they are all busy throughout the entire task. 
If this is truly the case then I really don't understand! Otherwise, as 
you say pmap may leave cores idle for a bit if some tasks take a lot 
longer than others...this is why I said equally-sized chunks 
before...make sure you monitor your cpus closely to verify that they are 
indeed busyThis has bitten me in the past... Of course, in your case 
(range 8)  indeed contains 'equally-sized' elemements so i don't know 
what to think! In any case there might be something going on that we're 
not seeing...


Jim


On 08/12/12 01:25, Lee Spector wrote:

I've been running compute intensive (multi-day), highly parallelizable Clojure processes 
on high-core-count machines and blithely assuming that since I saw near maximal CPU 
utilization in top and the like that I was probably getting good speedups.

But a colleague recently did some tests and the results are really quite 
alarming.

On intel machines we're seeing speedups but much less than I expected -- about 
a 2x speedup going from 1 to 8 cores.

But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost 
twice as long on 8 cores as on 1.

I'm baffled, and unhappy that my runs are probably going slower on 48-core and 
64-core nodes than on single-core nodes.

It's possible that I'm just doing something wrong in the way that I dispatch 
the tasks, or that I've missed some Clojure or JVM setting... but right now I'm 
mystified and would really appreciate some help.

I'm aware that there's overhead for multicore distribution and that one can 
expect slowdowns if the computations that are being distributed are fast 
relative to the dispatch overhead, but this should not be the case here. We're 
distributing computations that take seconds or minutes, and not huge numbers of 
them (at least in our tests while trying to figure out what's going on).

I'm also aware that the test that produced the data I give below, insofar as it 
uses pmap to do the distribution, may leave cores idle for a bit if some tasks 
take a lot longer than others, because of the way that pmap allocates cores to 
threads. But that also shouldn't be a big issue here because for this test all 
of the threads are doing the exact same computation. And I also tried using an 
agent-based dispatch approach that shouldn't have the pmap thread allocation 
issue, and the results were about the same.

Note also that all of the computations in this test are purely functional and 
independent -- there shouldn't be any resource contention issues.

The test: I wrote a time-consuming function that just does a bunch of math and 
list manipulation (which is what takes a lot of time in my real applications):

(defn burn
   ([] (loop [i 0
  value '()]
 (if (= i 1)
   (count (last (take 1 (iterate reverse value
   (recur (inc i)
  (cons
(* (int i)
   (+ (float i)
  (- (int i)
 (/ (float i)
(inc (int i))
value)
   ([_] (burn)))

Then I have a main 

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Marshall Bockrath-Vandegrift
Lee Spector lspec...@hampshire.edu writes:

 I'm also aware that the test that produced the data I give below,
 insofar as it uses pmap to do the distribution, may leave cores idle
 for a bit if some tasks take a lot longer than others, because of the
 way that pmap allocates cores to threads.

Although it doesn’t impact your benchmark, `pmap` may be further
adversely affecting the performance of your actual program.  There’s a
open bug regarding `pmap` and chunked seqs:

http://dev.clojure.org/jira/browse/CLJ-862

The impact is that `pmap` with chunked seq input will spawn futures for
its function applications in flights of 32, spawning as many flights as
necessary to reach or exceed #CPUS + 2.  On a 48-way system, it will
initially launch 64 futures, then spawn an additional 32 every time the
number of active unrealized futures drops below 50, leading to
significant contention for a CPU-bound application.

I hope it can be made useful in a future version of Clojure, but right
now `pmap` is more of an attractive nuisance than anything else.

-Marshall

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector

On Dec 7, 2012, at 9:42 PM, Andy Fingerhut wrote:
 
 
 When you say we can run multiple instances of the test on the same machine, 
 do you mean that, for example, on an 8 core machine you run 8 different JVMs 
 in parallel, each doing a single-threaded 'map' in your Clojure code and not 
 a 'pmap'?
 
 And what kinds of speedups does that achieve?
 
 The results could help indicate whether the problem you are seeing is due to 
 the hardware/OS, or something about multiple threads within a single JVM.


Here are a couple more results, courtesy my colleague Josiah Erikson (who may 
chime in here at some point too):

On a 4 core machine, each test within a single lein run call:

2:31 for (doall (pmap burn (range 8)))
1:29 for (doall (map burn (range 8)))
1:48 for (doall (pmapall burn (range 8)))  [see my reply to Jim, immediately 
following this one, for pmapall]

Now with multiple lein run calls launched concurrently:

29 seconds to do 4 concurrent copies of (doall (map burn (range 2)))
33 seconds to do 8 concurrent copies of (burn 2)  [the argument to burn doesn't 
matter, but that's what Josiah ran]
38 seconds to do 8 concurrent copies of (burn (rand-int 8)) [again, the 
argument shouldn't matter].

48 seconds to do 2 concurrent copies of (doall (map burn (range 4)))
1:07 to do 2 concurrent copies of (doall (pmap burn (range 4)))

What do we learn from this, aside from the fact that things are way wacky?

Thanks for the help!!

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector

On Dec 8, 2012, at 9:36 AM, Marshall Bockrath-Vandegrift wrote:
 
 Although it doesn’t impact your benchmark, `pmap` may be further
 adversely affecting the performance of your actual program.  There’s a
 open bug regarding `pmap` and chunked seqs:
 
http://dev.clojure.org/jira/browse/CLJ-862
 
 The impact is that `pmap` with chunked seq input will spawn futures for
 its function applications in flights of 32, spawning as many flights as
 necessary to reach or exceed #CPUS + 2.  On a 48-way system, it will
 initially launch 64 futures, then spawn an additional 32 every time the
 number of active unrealized futures drops below 50, leading to
 significant contention for a CPU-bound application.
 
 I hope it can be made useful in a future version of Clojure, but right
 now `pmap` is more of an attractive nuisance than anything else.


(I said this would be in a reply to Jim, but now I see it makes more sense in a 
reply to this message of Marshall's.)

In order to avoid issues with pmap we've also run tests using pmapall, 
defined as follows:


(defn pmapall
  Like pmap but: 1) coll should be finite, 2) the returned sequence
   will not be lazy, 3) calls to f may occur in any order, to maximize
   multicore processor utilization, and 4) takes only one coll so far.
  [f coll]
  (let [agents (map agent coll)]
(dorun (map #(send % f) agents))
(apply await agents)
(doall (map deref agents

The results are more or less the same (awful).

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Paul deGrandis
My experiences in the past are similar to the numbers that Jim is reporting.

I have recently been centering most of my crunching code around reducers.
Is it possible for you to cook up a small representative test using 
reducers+fork/join (and potentially primitives in the intermediate steps)?

Perhaps it can shed some light on contention or chunking issues (and 
eliminate some of the allocation concerns).

Paul

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector
On Dec 8, 2012, at 1:28 PM, Paul deGrandis wrote:

 My experiences in the past are similar to the numbers that Jim is reporting.
 
 I have recently been centering most of my crunching code around reducers.
 Is it possible for you to cook up a small representative test using 
 reducers+fork/join (and potentially primitives in the intermediate steps)?
 
 Perhaps it can shed some light on contention or chunking issues (and 
 eliminate some of the allocation concerns).

Just tried, my first foray into reducers, but I must not be understanding 
something correctly:

  (time (r/map burn (doall (range 4 

returns in less than a second on my macbook pro, whereas

  (time (doall (map burn (range 4 

takes nearly a minute.

This feels like unforced laziness (although it's not quite that fast), but 
clojure.core.reducers/map involves no laziness, right?

I think my burn function (reproduced below for convenience) forces all of the 
laziness in there... 

So what am I doing wrong?

BTW the runs described in this message use Clojure 1.5.0-alpha3 -- the ones in 
previous messages in this thread used Clojure 1.3.0.

 -Lee

(defn burn 
  ([] (loop [i 0
 value '()]
(if (= i 1)
  (count (last (take 1 (iterate reverse value
  (recur (inc i)
 (cons 
   (* (int i) 
  (+ (float i) 
 (- (int i) 
(/ (float i) 
   (inc (int i))
   value)
  ([_] (burn)))

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Andy: The short answer is yes, and we saw huge speedups. My latest post, as 
well as Lee's, has details.

On Friday, December 7, 2012 9:42:03 PM UTC-5, Andy Fingerhut wrote:


 On Dec 7, 2012, at 5:25 PM, Lee Spector wrote: 

  
  Another strange observation is that we can run multiple instances of the 
 test on the same machine and (up to some limit, presumably) they don't seem 
 to slow each other down, even though just one instance of the test appears 
 to be maxing out all of the CPU according to top. I suppose that means 
 that top isn't telling me what I thought -- my colleague says it can mean 
 that something is blocked in some way with a full instruction queue. But 
 I'm not interested in running multiple instances. I have single 
 computations that involve multiple expensive but independent 
 subcomputations, and I want to farm those subcomputations out to multiple 
 cores -- and get speedups as a result. My subcomputations are so completely 
 independent that I think I should be able to get speedups approaching a 
 factor of n for n cores, but what I see is a factor of only about 2 on 
 intel machines, and a bizarre factor of about 1/2 on AMD machines. 

 Lee: 

 When you say we can run multiple instances of the test on the same 
 machine, do you mean that, for example, on an 8 core machine you run 8 
 different JVMs in parallel, each doing a single-threaded 'map' in your 
 Clojure code and not a 'pmap'? 

 And what kinds of speedups does that achieve? 

 The results could help indicate whether the problem you are seeing is due 
 to the hardware/OS, or something about multiple threads within a single 
 JVM. 

 Andy 



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running 
things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 
945, and found some very strange results, which I shall post here, after I 
explain a little function that Lee wrote that is designed to get improved 
results over pmap. It looks like this:

(defn pmapall
  Like pmap but: 1) coll should be finite, 2) the returned sequence
   will not be lazy, 3) calls to f may occur in any order, to maximize
   multicore processor utilization, and 4) takes only one coll so far.
  [f coll]
  (let [agents (map agent coll)]
(dorun (map #(send % f) agents))
(apply await agents)
(doall (map deref agents

Refer to Lee's first post for the benchmarking routine we're running.

I figured that, in order to figure out if it was Java's multithreading that 
was the problem (as opposed to memory bandwidth, or the OS, or whatever), 
I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent 
copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( 
map burn (range 2))) or whatever. Does this make sense? I THINK it does. If 
it doesn't, then that's cool - just let me know why and I'll feel less 
crazy, because I am finding my results rather confounding.

On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( 
pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 
to do ( doall( pmapall burn (range 8))).

So that's weird, because although we do see decreased slowdown from using 
pmapall, we still don't see a speedup compared to map. Watching processor 
utilization while these are going on shows that map is using one core, and 
both pmap and pmapall are using all four cores fully, as they should. So, 
maybe the OS or the hardware just can't deal with running that many copies 
of burn at once? Maybe there's a memory bottleneck?

Now here's the weird part: it takes around 29 seconds to do four concurrent 
copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies 
of (burn 2). Yes. Read that again. What? Watching top while this is going 
on shows what you would expect to see: When I run four concurrent copies, 
I've got four copies of Java using 100% of a core each, and when I run 
eight concurrent copies, I see eight copies of Java, all using around 50% 
of the processor each.

Also, by the way, it takes 48 seconds to run two concurrent copies of ( 
doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( 
doall( pmap burn (range 4))). 

What is going on here? Is Java's multithreading really THAT bad? This 
appears to me to prove that Java, or clojure, has something very seriously 
wrong with it, or has outrageous amounts of overhead when spawning a new 
thread. No?

all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and 
clojure 1.5.0-beta1
(I tried increasing the memory allowed for the pmap and pmapall runs, even 
to 8g, and it doesn't help at all)
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):

on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP


Any thoughts or ideas?

There's more weirdness, too, in case anybody in interested. I'm getting 
results that vary strangely from other benchmarks that are available, and 
make no sense to me. Check this out (these are incomplete, because I 
decided to dig deeper with the above benchmarks, but you'll see, I think, 
why this is so confusing, if you know how fast these processors are 
supposed to be):

all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and 
clojure 1.5.0-beta1
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):

Key:1. (pmap range 8) : 
2. (map range 8) : 
3. (8 concurrent copies of pmap range 8) : 
4. (8 concurrent copies of map range 8) :
5. pmapall range 8:

4x AMD Opteron 6168:
1. 4:02.06
2. 2:20.29
3.
4.

AMD Phenom II X4 945:
1. 2:31.65
2. 1:29.90
3. 3:32.60
4. 3:08.97
5. 1:48.36

AMD Phenom II X6 1100T:
1. 2:03.71
2. 1:14.76
3. 2:20.14
4. 1:57.38
5. 2:14.43

AMD FX 8120:
1. 4:50.06
2. 1:25.04
3. 5:55.84
4. 2:46.94
5. 4:36.61

AMD FX 8350:
1. 3:42.35
2. 1:13.94
3. 3:00.46
4. 2:06.18
5. 3:56.95

Intel Core i7 3770K:
1. 0:44
2. 1:37.18
3. 2:29.41
4. 2:16.05
5. 0:44.42

2 x Intel Paxville DP Xeon:
1. 6:26.112
2. 3:20.149
3. 8:09.85
4. 7:06.52
5. 5:55.29



On Saturday, December 8, 2012 9:36:56 AM UTC-5, Marshall 
Bockrath-Vandegrift wrote:

 Lee Spector lspe...@hampshire.edu javascript: writes: 

  I'm also aware that the test that produced the data I give below, 
  insofar as it uses pmap to do the distribution, may leave cores idle 
  for a bit if some tasks take a lot longer than others, because of the 
  way that pmap allocates cores to threads. 

 

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Andy Fingerhut
I haven't analyzed your results in detail, but here are some results I had on 
my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011.

When running multiple threads within a single JVM invocation, I never got a 
speedup of even 2.  The highest speedup I measured was 1.82 speedup when I ran 
8 threads using -XX:+UseParallelGC.  I tried with -XX:+UseParNewGC but never 
got a speedup over 1.45 (with 4 threads in parallel -- it was lower with 8 
threads).

When running multiple invocations of lein2 run in parallel as separate 
processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 with 
4 processes, and 5.34 with 8 processes (it went over 4 I think because of 2 
hyperthreads per each of the 4 cores).

This is a strong indication that the issue is some kind of interference between 
multiple threads in the same JVM, not the hardware, at least on my hardware and 
OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java 1.6.0_37).

My first guess would be that even with -XX:+UseParallelGC or -XX:+UseParNewGC, 
there is either some kind of interference with garbage collection, or perhaps 
there is even some kind of interference between them when allocating memory?  
Should JVM memory allocations be completely parallel with no synchronization 
when running multiple threads, or do memory allocations sometimes lock a shared 
data structure?

Andy


On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote:

 Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running 
 things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, 
 and found some very strange results, which I shall post here, after I explain 
 a little function that Lee wrote that is designed to get improved results 
 over pmap. It looks like this:
 
 (defn pmapall
   Like pmap but: 1) coll should be finite, 2) the returned sequence
will not be lazy, 3) calls to f may occur in any order, to maximize
multicore processor utilization, and 4) takes only one coll so far.
   [f coll]
   (let [agents (map agent coll)]
 (dorun (map #(send % f) agents))
 (apply await agents)
 (doall (map deref agents
 
 Refer to Lee's first post for the benchmarking routine we're running.
 
 I figured that, in order to figure out if it was Java's multithreading that 
 was the problem (as opposed to memory bandwidth, or the OS, or  whatever), 
 I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies 
 of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn 
 (range 2))) or whatever. Does this make sense? I THINK it does. If it 
 doesn't, then that's cool - just let me know why and I'll feel less crazy, 
 because I am finding my results rather confounding.
 
 On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap 
 burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( 
 doall( pmapall burn (range 8))).
 
 So that's weird, because although we do see decreased slowdown from using 
 pmapall, we still don't see a speedup compared to map. Watching processor 
 utilization while these are going on shows that map is using one core, and 
 both pmap and pmapall are using all four cores fully, as they should. So, 
 maybe the OS or the hardware just can't deal with running that many copies of 
 burn at once? Maybe there's a memory bottleneck?
 
 Now here's the weird part: it takes around 29 seconds to do four concurrent 
 copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of 
 (burn 2). Yes. Read that again. What? Watching top while this is going on 
 shows what you would expect to see: When I run four concurrent copies, I've 
 got four copies of Java using 100% of a core each, and when I run eight 
 concurrent copies, I see eight copies of Java, all using around 50% of the 
 processor each.
 
 Also, by the way, it takes 48 seconds to run two concurrent copies of ( 
 doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( 
 pmap burn (range 4))). 
 
 What is going on here? Is Java's multithreading really THAT bad? This appears 
 to me to prove that Java, or clojure, has something very seriously wrong with 
 it, or has outrageous amounts of overhead when spawning a new thread. No?
 
 all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and clojure 
 1.5.0-beta1
 (I tried increasing the memory allowed for the pmap and pmapall runs, even to 
 8g, and it doesn't help at all)
 Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
 Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):
 
 on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP
 
 
 Any thoughts or ideas?
 
 There's more weirdness, too, in case anybody in interested. I'm getting 
 results that vary strangely from other benchmarks that are available, and 
 make no sense to me. Check this out (these are incomplete, because I decided 
 to dig deeper with the above benchmarks, but you'll see, I think, why this 

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Andy Fingerhut

On Dec 7, 2012, at 5:25 PM, Lee Spector wrote:

 The test: I wrote a time-consuming function that just does a bunch of math 
 and list manipulation (which is what takes a lot of time in my real 
 applications):
 
 (defn burn 
  ([] (loop [i 0
 value '()]
(if (= i 1)
  (count (last (take 1 (iterate reverse value
  (recur (inc i)
 (cons 
   (* (int i) 
  (+ (float i) 
 (- (int i) 
(/ (float i) 
   (inc (int i))
   value)
  ([_] (burn)))

Lee:

I'm hoping you realize that (take 1 (iterate reverse value)) is reversing a 
linked list 1 times, each time allocating 1 cons cells (or Clojure's 
equivalent of a cons cell)?  For a total of around 100,000,000 memory 
allocations of small objects per call to burn?

I know that this is just a bit of code for testing purposes, and not your real 
application, but do you really do that quantity of memory allocation versus 
arithmetic in your code (7*10,000 arithmetic operations for every 10,000*10,000 
memory allocations?)

I still believe there is an issue here with either the memory allocation or 
garbage collection implementations that we are seeing here that can hopefully 
be improved with some command line options to the JVM that I don't know yet, 
but I know that I have seen much bigger speedups in code that didn't do this 
much memory allocation versus non-allocating computation (e.g. the Computer 
Language Benchmarks Game Clojure programs).

Andy

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector

On Dec 8, 2012, at 3:42 PM, Andy Fingerhut wrote:
 
 I'm hoping you realize that (take 1 (iterate reverse value)) is reversing 
 a linked list 1 times, each time allocating 1 cons cells (or 
 Clojure's equivalent of a cons cell)?  For a total of around 100,000,000 
 memory allocations of small objects per call to burn?
 
 I know that this is just a bit of code for testing purposes, and not your 
 real application, but do you really do that quantity of memory allocation 
 versus arithmetic in your code (7*10,000 arithmetic operations for every 
 10,000*10,000 memory allocations?)
 
 I still believe there is an issue here with either the memory allocation or 
 garbage collection implementations that we are seeing here that can hopefully 
 be improved with some command line options to the JVM that I don't know yet, 
 but I know that I have seen much bigger speedups in code that didn't do this 
 much memory allocation versus non-allocating computation (e.g. the Computer 
 Language Benchmarks Game Clojure programs).

The numbers were guesses, but my real applications do have something of this 
flavor. Mostly I'm doing genetic programming, which involves generating, 
running, mutating and recombining programs. Population sizes are often in the 
thousands, with program sizes up to thousands of instructions/literals. Often 
the programs include list manipulation instructions so there's lots of memory 
allocation involved in executing them (in addition to the allocation required 
to generate and manipulate them).

 -Lee

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Wm. Josiah Erikson
I'm glad somebody else can duplicate our findings! I get results similar to 
this on Intel hardware. On AMD hardware, the disparity is bigger, and 
multiple threads of a single JVM invocation on AMD hardware consistently 
gives me slowdowns as compared to a single thread. Also, your results are 
on MacOS and my results are on linux, so it begs the question: Is this 
generally true of Java, or is it something about clojure?

On Saturday, December 8, 2012 3:31:25 PM UTC-5, Andy Fingerhut wrote:

 I haven't analyzed your results in detail, but here are some results I had 
 on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011.

 When running multiple threads within a single JVM invocation, I never got 
 a speedup of even 2.  The highest speedup I measured was 1.82 speedup when 
 I ran 8 threads using -XX:+UseParallelGC.  I tried with -XX:+UseParNewGC 
 but never got a speedup over 1.45 (with 4 threads in parallel -- it was 
 lower with 8 threads).

 When running multiple invocations of lein2 run in parallel as separate 
 processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 
 with 4 processes, and 5.34 with 8 processes (it went over 4 I think because 
 of 2 hyperthreads per each of the 4 cores).

 This is a strong indication that the issue is some kind of interference 
 between multiple threads in the same JVM, not the hardware, at least on my 
 hardware and OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java 
 1.6.0_37).

 My first guess would be that even with -XX:+UseParallelGC or 
 -XX:+UseParNewGC, there is either some kind of interference with garbage 
 collection, or perhaps there is even some kind of interference between them 
 when allocating memory?  Should JVM memory allocations be completely 
 parallel with no synchronization when running multiple threads, or do 
 memory allocations sometimes lock a shared data structure?

 Andy


 On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote:

 Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running 
 things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 
 945, and found some very strange results, which I shall post here, after I 
 explain a little function that Lee wrote that is designed to get improved 
 results over pmap. It looks like this:

 (defn pmapall
   Like pmap but: 1) coll should be finite, 2) the returned sequence
will not be lazy, 3) calls to f may occur in any order, to maximize
multicore processor utilization, and 4) takes only one coll so far.
   [f coll]
   (let [agents (map agent coll)]
 (dorun (map #(send % f) agents))
 (apply await agents)
 (doall (map deref agents

 Refer to Lee's first post for the benchmarking routine we're running.

 I figured that, in order to figure out if it was Java's multithreading 
 that was the problem (as opposed to memory bandwidth, or the OS, or 
 whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 
 concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies 
 of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK 
 it does. If it doesn't, then that's cool - just let me know why and I'll 
 feel less crazy, because I am finding my results rather confounding.

 On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( 
 pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 
 to do ( doall( pmapall burn (range 8))).

 So that's weird, because although we do see decreased slowdown from using 
 pmapall, we still don't see a speedup compared to map. Watching processor 
 utilization while these are going on shows that map is using one core, and 
 both pmap and pmapall are using all four cores fully, as they should. So, 
 maybe the OS or the hardware just can't deal with running that many copies 
 of burn at once? Maybe there's a memory bottleneck?

 Now here's the weird part: it takes around 29 seconds to do four 
 concurrent copies of ( doall( map burn (range 2))), around 33 seconds to 
 run 8 copies of (burn 2). Yes. Read that again. What? Watching top while 
 this is going on shows what you would expect to see: When I run four 
 concurrent copies, I've got four copies of Java using 100% of a core each, 
 and when I run eight concurrent copies, I see eight copies of Java, all 
 using around 50% of the processor each.

 Also, by the way, it takes 48 seconds to run two concurrent copies of ( 
 doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( 
 doall( pmap burn (range 4))). 

 What is going on here? Is Java's multithreading really THAT bad? This 
 appears to me to prove that Java, or clojure, has something very seriously 
 wrong with it, or has outrageous amounts of overhead when spawning a new 
 thread. No?

 all run with :jvm-opts [-Xmx1g -Xms1g -XX:+AggressiveOpts] and 
 clojure 1.5.0-beta1
 (I tried increasing the memory allowed for the pmap and pmapall runs, even 
 to 8g, and it doesn't help at all)
 Java(TM) SE 

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Marek Šrank


 Just tried, my first foray into reducers, but I must not be understanding 
 something correctly: 

   (time (r/map burn (doall (range 4 

 returns in less than a second on my macbook pro, whereas 

   (time (doall (map burn (range 4 

 takes nearly a minute. 

 This feels like unforced laziness (although it's not quite that fast), but 
 clojure.core.reducers/map involves no laziness, right? 


Yep, reducers, don't use lazy seqs. But they return just sth. like 
transformed functions, that will be applied when building the collection. 
So you can use them like this:

(into [] (r/map burn (doall (range 4)

See 
http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
and http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html for more 
info...


Marek.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Lee:

So I ran

On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote:


 I've been running compute intensive (multi-day), highly parallelizable 
 Clojure processes on high-core-count machines and blithely assuming that 
 since I saw near maximal CPU utilization in top and the like that I was 
 probably getting good speedups. 

 But a colleague recently did some tests and the results are really quite 
 alarming. 

 On intel machines we're seeing speedups but much less than I expected -- 
 about a 2x speedup going from 1 to 8 cores. 

 But on AMD processors we're seeing SLOWDOWNS, with the same tests taking 
 almost twice as long on 8 cores as on 1. 

 I'm baffled, and unhappy that my runs are probably going slower on 48-core 
 and 64-core nodes than on single-core nodes. 

 It's possible that I'm just doing something wrong in the way that I 
 dispatch the tasks, or that I've missed some Clojure or JVM setting... but 
 right now I'm mystified and would really appreciate some help. 

 I'm aware that there's overhead for multicore distribution and that one 
 can expect slowdowns if the computations that are being distributed are 
 fast relative to the dispatch overhead, but this should not be the case 
 here. We're distributing computations that take seconds or minutes, and not 
 huge numbers of them (at least in our tests while trying to figure out 
 what's going on). 

 I'm also aware that the test that produced the data I give below, insofar 
 as it uses pmap to do the distribution, may leave cores idle for a bit if 
 some tasks take a lot longer than others, because of the way that pmap 
 allocates cores to threads. But that also shouldn't be a big issue here 
 because for this test all of the threads are doing the exact same 
 computation. And I also tried using an agent-based dispatch approach that 
 shouldn't have the pmap thread allocation issue, and the results were about 
 the same. 

 Note also that all of the computations in this test are purely functional 
 and independent -- there shouldn't be any resource contention issues. 

 The test: I wrote a time-consuming function that just does a bunch of math 
 and list manipulation (which is what takes a lot of time in my real 
 applications): 

 (defn burn 
   ([] (loop [i 0 
  value '()] 
 (if (= i 1) 
   (count (last (take 1 (iterate reverse value 
   (recur (inc i) 
  (cons 
(* (int i) 
   (+ (float i) 
  (- (int i) 
 (/ (float i) 
(inc (int i)) 
value) 
   ([_] (burn))) 

 Then I have a main function like this: 

 (defn -main 
   [ args] 
   (time (doall (pmap burn (range 8 
   (System/exit 0)) 

 We run it with lein run (we've tried both leingingen 1.7.1 and 
 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We 
 also tried Java 1.6.0_22. We've tried various JVM memory options (via 
 :jvm-opts with -Xmx and -Xms settings) and also with and without 
 -XX:+UseParallelGC. None of this seems to change the picture substantially. 

 The results that we get generally look like this: 

 - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code 
 above, it takes about 45 seconds (and all cores appear to be fully loaded 
 as it does so). If we change the pmap to just plain map, so that we use 
 only a single core, the time goes up to about 1 minute and 36 seconds. So 
 the speedup for 8 cores is just about 2x, even though there are 8 
 completely independent tasks. So that's pretty depressing. 

 - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, 
 running the same test (with pmap) takes about 4 minutes and 2 seconds. 
 That's really slow! Changing the pmap to map here produces a runtime of 
 about 2 minutes and 20 seconds. So it's quite a bit faster on one core than 
 on 8! And all of these times are terrible compared to those on the intel. 

 Another strange observation is that we can run multiple instances of the 
 test on the same machine and (up to some limit, presumably) they don't seem 
 to slow each other down, even though just one instance of the test appears 
 to be maxing out all of the CPU according to top. I suppose that means 
 that top isn't telling me what I thought -- my colleague says it can mean 
 that something is blocked in some way with a full instruction queue. But 
 I'm not interested in running multiple instances. I have single 
 computations that involve multiple expensive but independent 
 subcomputations, and I want to farm those subcomputations out to multiple 
 cores -- and get speedups as a result. My subcomputations are so completely 
 independent that I think I should be able to get speedups approaching a 
 factor of n for n cores, but what I see is a factor of only about 2 on 
 intel machines, and a bizarre factor of about 1/2 on AMD machines. 

 

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Lee:

I ran Linux perf and also watched the run queue (with vmstat) and your 
bottleneck is basically memory access. The CPUs are idle 80% of the time by 
stalled cycles. Here's what I got on my machine.

Intel Core i7 4 cores with Hyper thread (8 virtual processors)
16 GiB of Memory
Oracle JVM

and this is how I ran perf. (Note: After monitoring the GC,  even 1GiB Heap 
is a lot of memory for your benchmark, it mostly used a lot less than 256 
MiB)
Also I ran the (time (doall ..)) three times to make sure that the JVM was 
warmed up. :)

 perf stat java -server -Xmx1024m -Xms1024m -jar 
target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar
Elapsed time: 49194.258825 msecs
Warm-up 1 (1 1 1 1 1 1 1 1)
Elapsed time: 48200.221677 msecs
Warm-up 2 (1 1 1 1 1 1 1 1)
Elapsed time: 50050.013156 msecs
Run  (1 1 1 1 1 1 1 1)

 Performance counter stats for 'java -server -Xmx1024m -Xms1024m -jar 
target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar':

1094967.286876 task-clock#7.383 CPUs utilized   
   
   170,384 context-switches  #0.000 M/sec   
   
 2,932 CPU-migrations#0.000 M/sec   
   
24,734 page-faults   #0.000 M/sec   
   
 3,004,827,628,531 cycles#2.744 GHz 
[83.34%]
 *2,430,611,924,472 stalled-cycles-frontend   #   80.89% frontend cycles 
idle[83.34%]*
* 1,976,836,501,358 stalled-cycles-backend#   65.79% backend  cycles 
idle[66.67%]*
   650,868,504,974 instructions  #0.22  insns per cycle 
   
 #3.73  stalled cycles per 
insn [83.33%]
   124,479,033,687 branches  #  113.683 M/sec   
[83.33%]
 1,753,021,734 branch-misses #1.41% of all branches 
[83.33%]

 148.307812402 seconds time elapsed

Now if you run vmstat 1 while running your benchmark you'll notice that the 
run queue will be most of the time at 8, meaning that 8 processes are 
waiting for CPU, and this is due to memory accesses (in this case, since 
this is not true for all applications).

So, I feel your benchmark may be artificial and does not truly represent 
your real application, make sure to profile your real application, and 
optimize according to the bottlenecks. There are really useful tools out 
there 
for profiling, such as VisualVM, perf.

Good Luck,

Carlos

On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote:


 I've been running compute intensive (multi-day), highly parallelizable 
 Clojure processes on high-core-count machines and blithely assuming that 
 since I saw near maximal CPU utilization in top and the like that I was 
 probably getting good speedups. 

 But a colleague recently did some tests and the results are really quite 
 alarming. 

 On intel machines we're seeing speedups but much less than I expected -- 
 about a 2x speedup going from 1 to 8 cores. 

 But on AMD processors we're seeing SLOWDOWNS, with the same tests taking 
 almost twice as long on 8 cores as on 1. 

 I'm baffled, and unhappy that my runs are probably going slower on 48-core 
 and 64-core nodes than on single-core nodes. 

 It's possible that I'm just doing something wrong in the way that I 
 dispatch the tasks, or that I've missed some Clojure or JVM setting... but 
 right now I'm mystified and would really appreciate some help. 

 I'm aware that there's overhead for multicore distribution and that one 
 can expect slowdowns if the computations that are being distributed are 
 fast relative to the dispatch overhead, but this should not be the case 
 here. We're distributing computations that take seconds or minutes, and not 
 huge numbers of them (at least in our tests while trying to figure out 
 what's going on). 

 I'm also aware that the test that produced the data I give below, insofar 
 as it uses pmap to do the distribution, may leave cores idle for a bit if 
 some tasks take a lot longer than others, because of the way that pmap 
 allocates cores to threads. But that also shouldn't be a big issue here 
 because for this test all of the threads are doing the exact same 
 computation. And I also tried using an agent-based dispatch approach that 
 shouldn't have the pmap thread allocation issue, and the results were about 
 the same. 

 Note also that all of the computations in this test are purely functional 
 and independent -- there shouldn't be any resource contention issues. 

 The test: I wrote a time-consuming function that just does a bunch of math 
 and list manipulation (which is what takes a lot of time in my real 
 applications): 

 (defn burn 
   ([] (loop [i 0 
  value '()] 
 (if (= i 1) 
   (count (last (take 1 (iterate reverse value 
   (recur (inc i) 
  

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread meteorfox
Correction regarding the run-queue, this is not completely correct, :S . 
But the stalled cycles and memory accesses still holds.

Sorry for the misinformation.

On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote:


 I've been running compute intensive (multi-day), highly parallelizable 
 Clojure processes on high-core-count machines and blithely assuming that 
 since I saw near maximal CPU utilization in top and the like that I was 
 probably getting good speedups. 

 But a colleague recently did some tests and the results are really quite 
 alarming. 

 On intel machines we're seeing speedups but much less than I expected -- 
 about a 2x speedup going from 1 to 8 cores. 

 But on AMD processors we're seeing SLOWDOWNS, with the same tests taking 
 almost twice as long on 8 cores as on 1. 

 I'm baffled, and unhappy that my runs are probably going slower on 48-core 
 and 64-core nodes than on single-core nodes. 

 It's possible that I'm just doing something wrong in the way that I 
 dispatch the tasks, or that I've missed some Clojure or JVM setting... but 
 right now I'm mystified and would really appreciate some help. 

 I'm aware that there's overhead for multicore distribution and that one 
 can expect slowdowns if the computations that are being distributed are 
 fast relative to the dispatch overhead, but this should not be the case 
 here. We're distributing computations that take seconds or minutes, and not 
 huge numbers of them (at least in our tests while trying to figure out 
 what's going on). 

 I'm also aware that the test that produced the data I give below, insofar 
 as it uses pmap to do the distribution, may leave cores idle for a bit if 
 some tasks take a lot longer than others, because of the way that pmap 
 allocates cores to threads. But that also shouldn't be a big issue here 
 because for this test all of the threads are doing the exact same 
 computation. And I also tried using an agent-based dispatch approach that 
 shouldn't have the pmap thread allocation issue, and the results were about 
 the same. 

 Note also that all of the computations in this test are purely functional 
 and independent -- there shouldn't be any resource contention issues. 

 The test: I wrote a time-consuming function that just does a bunch of math 
 and list manipulation (which is what takes a lot of time in my real 
 applications): 

 (defn burn 
   ([] (loop [i 0 
  value '()] 
 (if (= i 1) 
   (count (last (take 1 (iterate reverse value 
   (recur (inc i) 
  (cons 
(* (int i) 
   (+ (float i) 
  (- (int i) 
 (/ (float i) 
(inc (int i)) 
value) 
   ([_] (burn))) 

 Then I have a main function like this: 

 (defn -main 
   [ args] 
   (time (doall (pmap burn (range 8 
   (System/exit 0)) 

 We run it with lein run (we've tried both leingingen 1.7.1 and 
 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We 
 also tried Java 1.6.0_22. We've tried various JVM memory options (via 
 :jvm-opts with -Xmx and -Xms settings) and also with and without 
 -XX:+UseParallelGC. None of this seems to change the picture substantially. 

 The results that we get generally look like this: 

 - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code 
 above, it takes about 45 seconds (and all cores appear to be fully loaded 
 as it does so). If we change the pmap to just plain map, so that we use 
 only a single core, the time goes up to about 1 minute and 36 seconds. So 
 the speedup for 8 cores is just about 2x, even though there are 8 
 completely independent tasks. So that's pretty depressing. 

 - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, 
 running the same test (with pmap) takes about 4 minutes and 2 seconds. 
 That's really slow! Changing the pmap to map here produces a runtime of 
 about 2 minutes and 20 seconds. So it's quite a bit faster on one core than 
 on 8! And all of these times are terrible compared to those on the intel. 

 Another strange observation is that we can run multiple instances of the 
 test on the same machine and (up to some limit, presumably) they don't seem 
 to slow each other down, even though just one instance of the test appears 
 to be maxing out all of the CPU according to top. I suppose that means 
 that top isn't telling me what I thought -- my colleague says it can mean 
 that something is blocked in some way with a full instruction queue. But 
 I'm not interested in running multiple instances. I have single 
 computations that involve multiple expensive but independent 
 subcomputations, and I want to farm those subcomputations out to multiple 
 cores -- and get speedups as a result. My subcomputations are so completely 
 independent that I think I should be able to get speedups approaching 

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector

On Dec 8, 2012, at 8:16 PM, Marek Šrank wrote:
 
 Yep, reducers, don't use lazy seqs. But they return just sth. like 
 transformed functions, that will be applied when building the collection. So 
 you can use them like this:
 
 (into [] (r/map burn (doall (range 4)
 
 See 
 http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
 and http://clojure.com/blog/2012/05/15/anatomy-of-reducer.html for more 
 info...
 

Thanks Marek. This does fix the too quick to be true issue, but alas, on my 
mac laptop (a 4 core intel i7):

57522.869 msecs for (time (into [] (r/map burn (doall (range 4)

58263.312 msecs for (time (doall (map burn (doall (range 4)

So while I'm not getting a terrible slowdown from using the reducers version of 
map, I'm also not getting any speedup over the single-thread map.

We should try this on our other architectures too, but it doesn't look 
promising.

 -Lee

--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspec...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread Lee Spector

On Dec 8, 2012, at 10:19 PM, meteorfox wrote:
 
 Now if you run vmstat 1 while running your benchmark you'll notice that the 
 run queue will be most of the time at 8, meaning that 8 processes are 
 waiting for CPU, and this is due to memory accesses (in this case, since this 
 is not true for all applications).
 
 So, I feel your benchmark may be artificial and does not truly represent your 
 real application, make sure to profile your real application, and optimize 
 according to the bottlenecks. There are really useful tools out there 
 for profiling, such as VisualVM, perf.

Thanks Carlos.

I don't actually think that the benchmark is particularly artificial. It's very 
difficult to profile my actual application because it uses random numbers all 
over the place and is highly and nonlinearly variable in lots of ways. But I 
think that the benchmark I'm running really is pretty representative.

In any event, WHY would all of that waiting be happening? Logically, nothing 
should have to be waiting for anything else. We know from the fact that we get 
good speedups from multiple simultaneous JVM runs, each just running one call 
to my burn function, that the hardware is capable of performing well with 
multiple instances of this benchmark running concurrently. It MUST be able to 
handle all of the memory allocation and everything else. It's just when we try 
to launch them all in parallel from the same Clojure process, using pmap OR 
agents OR reducers, that we fail to get the concurrency speedups. Why should 
this be? And what can be done about it?

I know nearly nothing about the internals of the JVM, but is there perhaps a 
bottleneck on memory allocation because there's a single serial allocator? 
Perhaps when we run multiple JVMs each has its own allocator so we don't have 
the bottleneck? If all of this makes sense and is true, then perhaps (wishful 
thinking) there's a JVM option like use parallel memory allocators that will 
fix it?!

Thanks,

 -Lee
 

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread cameron

Interesting problem, the slowdown seems to being caused by the reverse call 
(actually the calls to conj with a list argument).
Calling conj in a multi-threaded environment seems to have a significant 
performance impact when using lists
I created some alternate reverse implementations (the fastest uses a vector 
and cons), the gist with the test code can be found at 
https://gist.github.com/4243724

On a dual Xeon E5520 (8 physical cores) @ 2.27GHz with Linux runnning 
OpenJDK 1.7.0_09 I got the following results:

fast-reverse: map-ms: 3.3, pmap-ms 0.7, speedup 4.97
list-cons   : map-ms: 4.0, pmap-ms 0.7, speedup 6.13
vec-conj: map-ms: 4.0, pmap-ms 1.3, speedup 3.10
list-conj   : map-ms: 10.8, pmap-ms 21.2, speedup 0.51
clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is 
equivalent to the original code)

The following JDK command line options were used: 
  -Xmx20G -XX:MaxPermSize=8G -XX:+UseParallelGC 
-XX:+UseParallelOldGC 

Some other notes:
I ran the sample under YourKit and garbage collection represents a small 
percentage of execution time in both single and multi-threaded tests, there 
are no blocked threads for the duration of the test and there is no 
unexpected monitor/lock usage.

Cheers,
Cameron.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

2012-12-08 Thread cameron
I forgot to mention, I cut the number of reverse iterations down to 1000 
(not 1) so I wouldn't have to wait too long for criterium, the speedup 
numbers are representative of the full test though.

Cameron.


On Sunday, December 9, 2012 6:26:16 PM UTC+11, cameron wrote:


 Interesting problem, the slowdown seems to being caused by the reverse 
 call (actually the calls to conj with a list argument).
 Calling conj in a multi-threaded environment seems to have a significant 
 performance impact when using lists
 I created some alternate reverse implementations (the fastest uses a 
 vector and cons), the gist with the test code can be found at 
 https://gist.github.com/4243724

 On a dual Xeon E5520 (8 physical cores) @ 2.27GHz with Linux runnning 
 OpenJDK 1.7.0_09 I got the following results:

 fast-reverse: map-ms: 3.3, pmap-ms 0.7, speedup 4.97
 list-cons   : map-ms: 4.0, pmap-ms 0.7, speedup 6.13
 vec-conj: map-ms: 4.0, pmap-ms 1.3, speedup 3.10
 list-conj   : map-ms: 10.8, pmap-ms 21.2, speedup 0.51
 clojure-reverse : map-ms: 13.5, pmap-ms 26.8, speedup 0.50 (this is 
 equivalent to the original code)

 The following JDK command line options were used: 
   -Xmx20G -XX:MaxPermSize=8G -XX:+UseParallelGC 
 -XX:+UseParallelOldGC 

 Some other notes:
 I ran the sample under YourKit and garbage collection represents a small 
 percentage of execution time in both single and multi-threaded tests, there 
 are no blocked threads for the duration of the test and there is no 
 unexpected monitor/lock usage.

 Cheers,
 Cameron.



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

abysmal multicore performance, especially on AMD processors

2012-12-07 Thread Lee Spector

I've been running compute intensive (multi-day), highly parallelizable Clojure 
processes on high-core-count machines and blithely assuming that since I saw 
near maximal CPU utilization in top and the like that I was probably getting 
good speedups. 

But a colleague recently did some tests and the results are really quite 
alarming. 

On intel machines we're seeing speedups but much less than I expected -- about 
a 2x speedup going from 1 to 8 cores.

But on AMD processors we're seeing SLOWDOWNS, with the same tests taking almost 
twice as long on 8 cores as on 1.

I'm baffled, and unhappy that my runs are probably going slower on 48-core and 
64-core nodes than on single-core nodes. 

It's possible that I'm just doing something wrong in the way that I dispatch 
the tasks, or that I've missed some Clojure or JVM setting... but right now I'm 
mystified and would really appreciate some help.

I'm aware that there's overhead for multicore distribution and that one can 
expect slowdowns if the computations that are being distributed are fast 
relative to the dispatch overhead, but this should not be the case here. We're 
distributing computations that take seconds or minutes, and not huge numbers of 
them (at least in our tests while trying to figure out what's going on).

I'm also aware that the test that produced the data I give below, insofar as it 
uses pmap to do the distribution, may leave cores idle for a bit if some tasks 
take a lot longer than others, because of the way that pmap allocates cores to 
threads. But that also shouldn't be a big issue here because for this test all 
of the threads are doing the exact same computation. And I also tried using an 
agent-based dispatch approach that shouldn't have the pmap thread allocation 
issue, and the results were about the same.

Note also that all of the computations in this test are purely functional and 
independent -- there shouldn't be any resource contention issues.

The test: I wrote a time-consuming function that just does a bunch of math and 
list manipulation (which is what takes a lot of time in my real applications):

(defn burn 
  ([] (loop [i 0
 value '()]
(if (= i 1)
  (count (last (take 1 (iterate reverse value
  (recur (inc i)
 (cons 
   (* (int i) 
  (+ (float i) 
 (- (int i) 
(/ (float i) 
   (inc (int i))
   value)
  ([_] (burn)))

Then I have a main function like this:

(defn -main 
  [ args]
  (time (doall (pmap burn (range 8
  (System/exit 0))

We run it with lein run (we've tried both leingingen 1.7.1 and 
2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We also 
tried Java 1.6.0_22. We've tried various JVM memory options (via :jvm-opts with 
-Xmx and -Xms settings) and also with and without -XX:+UseParallelGC. None of 
this seems to change the picture substantially.

The results that we get generally look like this:

- On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code 
above, it takes about 45 seconds (and all cores appear to be fully loaded as it 
does so). If we change the pmap to just plain map, so that we use only a single 
core, the time goes up to about 1 minute and 36 seconds. So the speedup for 8 
cores is just about 2x, even though there are 8 completely independent tasks. 
So that's pretty depressing.

- But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, running 
the same test (with pmap) takes about 4 minutes and 2 seconds. That's really 
slow! Changing the pmap to map here produces a runtime of about 2 minutes and 
20 seconds. So it's quite a bit faster on one core than on 8! And all of these 
times are terrible compared to those on the intel.

Another strange observation is that we can run multiple instances of the test 
on the same machine and (up to some limit, presumably) they don't seem to slow 
each other down, even though just one instance of the test appears to be maxing 
out all of the CPU according to top. I suppose that means that top isn't 
telling me what I thought -- my colleague says it can mean that something is 
blocked in some way with a full instruction queue. But I'm not interested in 
running multiple instances. I have single computations that involve multiple 
expensive but independent subcomputations, and I want to farm those 
subcomputations out to multiple cores -- and get speedups as a result. My 
subcomputations are so completely independent that I think I should be able to 
get speedups approaching a factor of n for n cores, but what I see is a factor 
of only about 2 on intel machines, and a bizarre factor of about 1/2 on AMD 
machines.

Any help would be greatly appreciated!

Thanks,

 -Lee

--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359

Re: abysmal multicore performance, especially on AMD processors

2012-12-07 Thread Andy Fingerhut
Lee:

I'll just give a brief description right now, but one thing I've found in the 
past on a 2-core machine that was achieving much less than 2x speedup was 
memory bandwidth being the limiting factor.

Not all Clojure code allocates memory, but a lot does.  If the hardware in a 
system can write at rate X from a multicore processor to main memory, and a 
single-threaded Clojure program writes to memory at rate 0.5*X, then the most 
speedup you will ever get out of multicore execution of the same code on N 
cores will be 2x, no matter how large N is.

As one way to see if this is the problem, you could try changing your burn 
function so that instead of doing cons to build up a list result, first 
allocate a Java mutable array before the loop that is as large as you need it 
to be at the end, and write values into that.  You can convert it to some other 
Clojure type at the end of the loop if you prefer.


I have some C benchmark programs that test memory read and write bandwidth on 
single and multiple cores you can run on your Intel machine to see if that 
might be the issue.  If this is the issue, I would expect to see at least a 
little speedup from 1 core to multiple cores, but capped at some maximum 
speedup that is determined by the memory bandwidth, not the number of cores you 
run in parallel.

I don't currently have any guess about what might be happening with the AMD 
multicore machine.  If you are interested in wild guessing, perhaps there could 
be some kind of multicore cache coherency protocol that is badly configured, 
causing cache lines to be frequently invalidated when multiple cores are 
sharing memory?  That would make more sense if multiple cores were reading from 
and writing to the same cache lines, which doesn't seem terribly likely for a 
typical Clojure program.

Let me know if you are interested and I will find those C programs for you to 
try out.  I got them from somewhere on the Internet and may have tweaked them a 
little bit.

Andy


On Dec 7, 2012, at 5:25 PM, Lee Spector wrote:

 
 I've been running compute intensive (multi-day), highly parallelizable 
 Clojure processes on high-core-count machines and blithely assuming that 
 since I saw near maximal CPU utilization in top and the like that I was 
 probably getting good speedups. 
 
 But a colleague recently did some tests and the results are really quite 
 alarming. 
 
 On intel machines we're seeing speedups but much less than I expected -- 
 about a 2x speedup going from 1 to 8 cores.
 
 But on AMD processors we're seeing SLOWDOWNS, with the same tests taking 
 almost twice as long on 8 cores as on 1.
 
 I'm baffled, and unhappy that my runs are probably going slower on 48-core 
 and 64-core nodes than on single-core nodes. 
 
 It's possible that I'm just doing something wrong in the way that I dispatch 
 the tasks, or that I've missed some Clojure or JVM setting... but right now 
 I'm mystified and would really appreciate some help.
 
 I'm aware that there's overhead for multicore distribution and that one can 
 expect slowdowns if the computations that are being distributed are fast 
 relative to the dispatch overhead, but this should not be the case here. 
 We're distributing computations that take seconds or minutes, and not huge 
 numbers of them (at least in our tests while trying to figure out what's 
 going on).
 
 I'm also aware that the test that produced the data I give below, insofar as 
 it uses pmap to do the distribution, may leave cores idle for a bit if some 
 tasks take a lot longer than others, because of the way that pmap allocates 
 cores to threads. But that also shouldn't be a big issue here because for 
 this test all of the threads are doing the exact same computation. And I also 
 tried using an agent-based dispatch approach that shouldn't have the pmap 
 thread allocation issue, and the results were about the same.
 
 Note also that all of the computations in this test are purely functional and 
 independent -- there shouldn't be any resource contention issues.
 
 The test: I wrote a time-consuming function that just does a bunch of math 
 and list manipulation (which is what takes a lot of time in my real 
 applications):
 
 (defn burn 
  ([] (loop [i 0
 value '()]
(if (= i 1)
  (count (last (take 1 (iterate reverse value
  (recur (inc i)
 (cons 
   (* (int i) 
  (+ (float i) 
 (- (int i) 
(/ (float i) 
   (inc (int i))
   value)
  ([_] (burn)))
 
 Then I have a main function like this:
 
 (defn -main 
  [ args]
  (time (doall (pmap burn (range 8
  (System/exit 0))
 
 We run it with lein run (we've tried both leingingen 1.7.1 and 
 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We 
 also tried Java 1.6.0_22. We've tried various JVM memory options (via 
 

  1   2   >