subject:"Re\: Poor parallelization performance across 18 cores \(but not 4\)"

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-20 Thread David Iba

Andy: Heh, glad to hear that I'm not the only one facing this issue, and I 
appreciate the encouragement since it's been kicking my ass the past week 
:)  On the bright side, as someone coming from more of a math background, 
this has forced me to learn a lot about how cpus/threads/memory/etc. work!

Herwig: I just got a chance to look through that thread you linked - sounds 
very very similar to what I'm encountering!

Niels: Glad to hear you're able to replicate the behavior.  I was also 
using claypoole's unordered pmap myself but excluded it in my code examples 
for simplicity :)  One thing to note that's tricky about benchmarking with 
hyperthreading enabled is that for fully CPU-bound jobs that don't share 
any cache and whatnot, if you're using all virtual-cores (8 in your case), 
a 2X slowdown would be expected.  Furthermore, if you launch less than the 
number of vCPUs available, it's possible that both threads get assigned to 
the same vCPU and thus again might run in 2X the time.  I noticed this 
seemed to happen more when the threads were spawned from the same java 
process (probably b/c it's presumed they can share cache) as opposed to 
separate processes.  So IMO the best way to test in this setting (without 
disabling HT) is to max out the vCPUs and compare against the expected 2X 
slowdown.

I think the "multiple threads allocating simultaneously" hypothesis makes 
the most sense so far.  This TLAB setting is interesting and I'll 
definitely give adjusting that a try - is setting the jvm option 
"-XX:+MinTLABSize" (like in the stackoverflow link Andy posted) the best 
way to go about this?

On Friday, November 20, 2015 at 5:53:42 PM UTC+9, Niels van Klaveren wrote:
>
> For what it's worth, here's the code I've been using while experimenting 
> along with this at home.
>
> Basically, it's a for loop over a collection of functions and a collection 
> of core counts, running a fixed number of tasks.
> So every function it can step up from running f on one core n times to f 
> on x cores one time. I use com.climate/claypoole's unordered pmap, which 
> gives a nice abstraction over spawning futures.
>
> Included are two function sets: summation and key assoc (since the 
> cross-comparison used in the OP bugged me a bit)
> Suggestions for alterations are welcome, but tests I ran seem to show that 
> all variants of the functions slow down considerably the more it is run in 
> parallel. (2-3x overhead compared to a single core run).
>
> Granted, I only could test this on a 4 core (8 hyperthreading) machine.
>
> Thursday, November 19, 2015 at 9:58:47 PM UTC+1, Andy Fingerhut wrote:
>>
>> David:
>>
>> No new suggestions to add right now.  Herwig's suggestion that it could 
>> be the Java allocator has some evidence for it given your results.  I'm not 
>> sure whether this StackOverflow Q on TLAB is fully accurate, but it may 
>> provide some useful info:
>>
>>
>> http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab
>>
>> I mainly wanted to give you a virtual high-five, kudos, and thank-you 
>> thank-you thank-you thank-you thank-you for taking the time to run these 
>> experiments.  Similar performance issues with many threads in the same JVM 
>> on a many-core machine have come up before in the past, and so far I don't 
>> know if anyone has gotten to the bottom of it yet.
>>
>> Andy
>>
>>
>> On Wed, Nov 18, 2015 at 10:36 PM, David Iba  wrote:
>>
>>> OK, have a few updates to report:
>>>
>>>- Oracle vs OpenJDK did not make a difference
>>>- Whenever I run N>1 threads calling any of these functions with 
>>>swap/vswap, there is some overhead compared to running 18 separate 
>>>single-run processes in parallel.  This overhead seems to increase as N 
>>>increases.
>>>- For both swap and vswap, the function timings from running 18 
>>>   futures (from one JVM) show about 1.5X the time from running 18 
>>> separate 
>>>   JVM processes.
>>>   - For the swap version (f2), very often a few of the calls would 
>>>   go rogue and take around 3X the time of the others.
>>>  - this did not happen for the vswap version of f2.
>>>   - Running 9 processes with 2 f2-calling threads each was maybe 4% 
>>>slower than 18 processes of 1.
>>>- Running 4 processes with 4 f2-calling threads each was mostly the 
>>>same speed as the 18x1, but there were a couple of those rogue threads 
>>> that 
>>>took 2-3X the time of the others.
>>>
>>> Any ideas?
>>>
>>> On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote:

 No worries.  Thanks, I'll give that a try as well!

 On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote:
>
> Oh, then I completely mis-understood the problem at hand here. If 
> that's the case then do the following:
>
> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that 
> changes anything. 
>

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread Herwig Hochleitner

This reminds me of another thread, where performance issues related to
concurrent allocation were explored in depth:
https://groups.google.com/d/topic/clojure/48W2eff3caU/discussion
The main takeaway for me was, that Hotspot will slow down pretty
dramatically, as soon as there are two threads allocating.

Could you try:

a) how performance develops, when you take out the allocation (assoc)
b) if increasing Hotspot's TLAB size will make any difference?


-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread Andy Fingerhut

David:

No new suggestions to add right now.  Herwig's suggestion that it could be
the Java allocator has some evidence for it given your results.  I'm not
sure whether this StackOverflow Q on TLAB is fully accurate, but it may
provide some useful info:

http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab

I mainly wanted to give you a virtual high-five, kudos, and thank-you
thank-you thank-you thank-you thank-you for taking the time to run these
experiments.  Similar performance issues with many threads in the same JVM
on a many-core machine have come up before in the past, and so far I don't
know if anyone has gotten to the bottom of it yet.

Andy


On Wed, Nov 18, 2015 at 10:36 PM, David Iba  wrote:

> OK, have a few updates to report:
>
>- Oracle vs OpenJDK did not make a difference
>- Whenever I run N>1 threads calling any of these functions with
>swap/vswap, there is some overhead compared to running 18 separate
>single-run processes in parallel.  This overhead seems to increase as N
>increases.
>- For both swap and vswap, the function timings from running 18
>   futures (from one JVM) show about 1.5X the time from running 18 separate
>   JVM processes.
>   - For the swap version (f2), very often a few of the calls would go
>   rogue and take around 3X the time of the others.
>  - this did not happen for the vswap version of f2.
>   - Running 9 processes with 2 f2-calling threads each was maybe 4%
>slower than 18 processes of 1.
>- Running 4 processes with 4 f2-calling threads each was mostly the
>same speed as the 18x1, but there were a couple of those rogue threads that
>took 2-3X the time of the others.
>
> Any ideas?
>
> On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote:
>>
>> No worries.  Thanks, I'll give that a try as well!
>>
>> On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote:
>>>
>>> Oh, then I completely mis-understood the problem at hand here. If that's
>>> the case then do the following:
>>>
>>> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that
>>> changes anything.
>>>
>>> Timothy
>>>
>>>
>>> On Wed, Nov 18, 2015 at 9:00 AM, David Iba  wrote:
>>>
 Timothy:  Each thread (call of f2) creates its own "local" atom, so I
 don't think there should be any swap retries.

 Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into
 trying Oracle and report back.

 Andy:  jvisualvm was showing pretty much all of the memory allocated in
 the eden space and a little in the first survivor (no major/full GC's), and
 total GC Time was very minimal.

 I'm in the middle of running some more tests and will report back when
 I get a chance today or tomorrow.  Thanks for all the feedback on this!

 On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:
>
> This sort of code is somewhat the worst case situation for atoms (or
> really for CAS). Clojure's swap! is based off the "compare-and-swap" or 
> CAS
> operation that most x86 CPUs have as an instruction. If we expand swap! it
> looks something like this:
>
> (loop [old-val @x*]
>   (let [new-val (assoc old-val :k i)]
> (if (compare-and-swap x* old-val new-val)
>new-val
>(recur @x*)))
>
> Compare-and-swap can be defined as "updates the content of the
> reference to new-val only if the current value of the reference is equal 
> to
> the old-val).
>
> So in essence, only one core can be modifying the contents of an atom
> at a time, if the atom is modified during the execution of the swap! call,
> then swap! will continue to re-run your function until it's able to update
> the atom without it being modified during the function's execution.
>
> So let's say you have some super long task that you need to integrate
> into a ref, he's one way to do it, but probably not the best:
>
> (let [a (atom 0)]
>   (dotimes [x 18]
> (future
> (swap! a long-operation-on-score some-param
>
>
> In this case long-operation-on-score will need to be re-run every time
> a thread modifies the atom. However if our function only needs the state 
> of
> the ref to add to it, then we can do something like this instead:
>
> (let [a (atom 0)]
>   (dotimes [x 18]
> (future
> (let [score (long-operation-on-score some-param)
>   (swap! a + score)
>
> Now we only have a simple addition inside the swap! and we will have
> less contention between the CPUs because they will most likely be spending
> more time inside 'long-operation-on-score' instead of inside the swap.
>
> *TL;DR*: do as little work as possible inside swap! the more you have
> inside swap! the higher

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread Fluid Dynamics

On Thursday, November 19, 2015 at 1:36:59 AM UTC-5, David Iba wrote:
>
> OK, have a few updates to report:
>
>- Oracle vs OpenJDK did not make a difference
>- Whenever I run N>1 threads calling any of these functions with 
>swap/vswap, there is some overhead compared to running 18 separate 
>single-run processes in parallel.  This overhead seems to increase as N 
>increases.
>- For both swap and vswap, the function timings from running 18 
>   futures (from one JVM) show about 1.5X the time from running 18 
> separate 
>   JVM processes.
>   - For the swap version (f2), very often a few of the calls would go 
>   rogue and take around 3X the time of the others.
>  - this did not happen for the vswap version of f2.
>   - Running 9 processes with 2 f2-calling threads each was maybe 4% 
>slower than 18 processes of 1.
>- Running 4 processes with 4 f2-calling threads each was mostly the 
>same speed as the 18x1, but there were a couple of those rogue threads 
> that 
>took 2-3X the time of the others.
>
> Any ideas?
>

Try a one-element array and aset, and see if that's faster than atom/swap 
and volatile/vswap. The latter two have memory barriers, the former does 
not, so if it's flushing the CPU cache that's the key here, aset should be 
faster, but if it's something else, it will probably be the same speed. 

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-19 Thread David Iba

Yeah, I actually tried using aset as well, and was still seeing these
"rogue" threads taking much longer (although the ones that did finish in a
normal amount of time had very similar completion times to those running in
their own process.)

Herwig: I will try those suggestions when I get a chance.



On Thu, Nov 19, 2015 at 6:19 PM, Fluid Dynamics  wrote:

> On Thursday, November 19, 2015 at 1:36:59 AM UTC-5, David Iba wrote:
>>
>> OK, have a few updates to report:
>>
>>- Oracle vs OpenJDK did not make a difference
>>- Whenever I run N>1 threads calling any of these functions with
>>swap/vswap, there is some overhead compared to running 18 separate
>>single-run processes in parallel.  This overhead seems to increase as N
>>increases.
>>- For both swap and vswap, the function timings from running 18
>>   futures (from one JVM) show about 1.5X the time from running 18 
>> separate
>>   JVM processes.
>>   - For the swap version (f2), very often a few of the calls would
>>   go rogue and take around 3X the time of the others.
>>  - this did not happen for the vswap version of f2.
>>   - Running 9 processes with 2 f2-calling threads each was maybe 4%
>>slower than 18 processes of 1.
>>- Running 4 processes with 4 f2-calling threads each was mostly the
>>same speed as the 18x1, but there were a couple of those rogue threads 
>> that
>>took 2-3X the time of the others.
>>
>> Any ideas?
>>
>
> Try a one-element array and aset, and see if that's faster than atom/swap
> and volatile/vswap. The latter two have memory barriers, the former does
> not, so if it's flushing the CPU cache that's the key here, aset should be
> faster, but if it's something else, it will probably be the same speed.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "Clojure" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/clojure/W-sddnit69Q/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread David Iba

Timothy:  Each thread (call of f2) creates its own "local" atom, so I don't 
think there should be any swap retries.

Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into trying 
Oracle and report back.

Andy:  jvisualvm was showing pretty much all of the memory allocated in the 
eden space and a little in the first survivor (no major/full GC's), and 
total GC Time was very minimal.

I'm in the middle of running some more tests and will report back when I 
get a chance today or tomorrow.  Thanks for all the feedback on this!

On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:
>
> This sort of code is somewhat the worst case situation for atoms (or 
> really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS 
> operation that most x86 CPUs have as an instruction. If we expand swap! it 
> looks something like this:
>
> (loop [old-val @x*]
>   (let [new-val (assoc old-val :k i)]
> (if (compare-and-swap x* old-val new-val)
>new-val
>(recur @x*)))
>
> Compare-and-swap can be defined as "updates the content of the reference 
> to new-val only if the current value of the reference is equal to the 
> old-val). 
>
> So in essence, only one core can be modifying the contents of an atom at a 
> time, if the atom is modified during the execution of the swap! call, then 
> swap! will continue to re-run your function until it's able to update the 
> atom without it being modified during the function's execution. 
>
> So let's say you have some super long task that you need to integrate into 
> a ref, he's one way to do it, but probably not the best:
>
> (let [a (atom 0)]
>   (dotimes [x 18]
> (future
> (swap! a long-operation-on-score some-param
>
>
> In this case long-operation-on-score will need to be re-run every time a 
> thread modifies the atom. However if our function only needs the state of 
> the ref to add to it, then we can do something like this instead:
>
> (let [a (atom 0)]
>   (dotimes [x 18]
> (future
> (let [score (long-operation-on-score some-param)
>   (swap! a + score)
>
> Now we only have a simple addition inside the swap! and we will have less 
> contention between the CPUs because they will most likely be spending more 
> time inside 'long-operation-on-score' instead of inside the swap.
>
> *TL;DR*: do as little work as possible inside swap! the more you have 
> inside swap! the higher chance you will have of throwing away work due to 
> swap! retries. 
>
> Timothy
>
> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta  > wrote:
>
>> by the way, have you tried both Oracle and Open JDK with the same results?
>> Gianluca
>>
>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote:
>>>
>>> David, you say "Based on jvisualvm monitoring, doesn't seem to be 
>>> GC-related".
>>>
>>> What is jvisualvm showing you related to GC and/or memory allocation 
>>> when you tried the 18-core version with 18 threads in the same process?
>>>
>>> Even memory allocation could become a point of contention, depending 
>>> upon how the memory allocation works with many threads.  e.g. Depends on 
>>> whether a thread gets a large chunk of memory on a global lock, and then 
>>> locally carves it up into the small pieces it needs for each individual 
>>> Java 'new' allocation, or gets a global lock for every 'new'.  The latter 
>>> would give terrible performance as # cores increase, but I don't know how 
>>> to tell whether that is the case, except by knowing more about how the 
>>> memory allocator is implemented in your JVM.  Maybe digging through OpenJDK 
>>> source code in the right place would tell?
>>>
>>> Andy
>>>
>>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba  wrote:
>>>
 correction: that "do" should be a "doall".  (My actual test code was a 
 bit different, but each run printed some info when it started so it 
 doesn't 
 have to do with delayed evaluation of lazy seq's or anything).


 On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>
> Andy:  Interesting.  Thanks for educating me on the fact that atom 
> swap's don't use the STM.  Your theory seems plausible... I will try 
> those 
> tests next time I launch the 18-core instance, but yeah, not sure how 
> illuminating the results will be.
>
> Niels: along the lines of this (so that each thread prints its time as 
> well as printing the overall time):
>
>1.   (time
>2.(let [f f1
>3.  n-runs 18
>4.  futs (do (for [i (range n-runs)]
>5. (future (time (f)]
>6.  (doseq [fut futs]
>7.@fut)))
>
>
> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren 
> wrote:
>>
>> Could you also show how you are running these functions in parallel 
>> and time them ? The way

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread David Iba

No worries.  Thanks, I'll give that a try as well!

On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote:
>
> Oh, then I completely mis-understood the problem at hand here. If that's 
> the case then do the following:
>
> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes 
> anything. 
>
> Timothy
>
>
> On Wed, Nov 18, 2015 at 9:00 AM, David Iba  > wrote:
>
>> Timothy:  Each thread (call of f2) creates its own "local" atom, so I 
>> don't think there should be any swap retries.
>>
>> Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into 
>> trying Oracle and report back.
>>
>> Andy:  jvisualvm was showing pretty much all of the memory allocated in 
>> the eden space and a little in the first survivor (no major/full GC's), and 
>> total GC Time was very minimal.
>>
>> I'm in the middle of running some more tests and will report back when I 
>> get a chance today or tomorrow.  Thanks for all the feedback on this!
>>
>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:
>>>
>>> This sort of code is somewhat the worst case situation for atoms (or 
>>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS 
>>> operation that most x86 CPUs have as an instruction. If we expand swap! it 
>>> looks something like this:
>>>
>>> (loop [old-val @x*]
>>>   (let [new-val (assoc old-val :k i)]
>>> (if (compare-and-swap x* old-val new-val)
>>>new-val
>>>(recur @x*)))
>>>
>>> Compare-and-swap can be defined as "updates the content of the reference 
>>> to new-val only if the current value of the reference is equal to the 
>>> old-val). 
>>>
>>> So in essence, only one core can be modifying the contents of an atom at 
>>> a time, if the atom is modified during the execution of the swap! call, 
>>> then swap! will continue to re-run your function until it's able to update 
>>> the atom without it being modified during the function's execution. 
>>>
>>> So let's say you have some super long task that you need to integrate 
>>> into a ref, he's one way to do it, but probably not the best:
>>>
>>> (let [a (atom 0)]
>>>   (dotimes [x 18]
>>> (future
>>> (swap! a long-operation-on-score some-param
>>>
>>>
>>> In this case long-operation-on-score will need to be re-run every time a 
>>> thread modifies the atom. However if our function only needs the state of 
>>> the ref to add to it, then we can do something like this instead:
>>>
>>> (let [a (atom 0)]
>>>   (dotimes [x 18]
>>> (future
>>> (let [score (long-operation-on-score some-param)
>>>   (swap! a + score)
>>>
>>> Now we only have a simple addition inside the swap! and we will have 
>>> less contention between the CPUs because they will most likely be spending 
>>> more time inside 'long-operation-on-score' instead of inside the swap.
>>>
>>> *TL;DR*: do as little work as possible inside swap! the more you have 
>>> inside swap! the higher chance you will have of throwing away work due to 
>>> swap! retries. 
>>>
>>> Timothy
>>>
>>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta  
>>> wrote:
>>>
 by the way, have you tried both Oracle and Open JDK with the same 
 results?
 Gianluca

 On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote:
>
> David, you say "Based on jvisualvm monitoring, doesn't seem to be 
> GC-related".
>
> What is jvisualvm showing you related to GC and/or memory allocation 
> when you tried the 18-core version with 18 threads in the same process?
>
> Even memory allocation could become a point of contention, depending 
> upon how the memory allocation works with many threads.  e.g. Depends on 
> whether a thread gets a large chunk of memory on a global lock, and then 
> locally carves it up into the small pieces it needs for each individual 
> Java 'new' allocation, or gets a global lock for every 'new'.  The latter 
> would give terrible performance as # cores increase, but I don't know how 
> to tell whether that is the case, except by knowing more about how the 
> memory allocator is implemented in your JVM.  Maybe digging through 
> OpenJDK 
> source code in the right place would tell?
>
> Andy
>
> On Tue, Nov 17, 2015 at 2:00 AM, David Iba  wrote:
>
>> correction: that "do" should be a "doall".  (My actual test code was 
>> a bit different, but each run printed some info when it started so it 
>> doesn't have to do with delayed evaluation of lazy seq's or anything).
>>
>>
>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>>>
>>> Andy:  Interesting.  Thanks for educating me on the fact that atom 
>>> swap's don't use the STM.  Your theory seems plausible... I will try 
>>> those 
>>> tests next time I launch the 18-core instance, but yeah, not sure how

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread gianluca torta

by the way, have you tried both Oracle and Open JDK with the same results?
Gianluca

On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote:
>
> David, you say "Based on jvisualvm monitoring, doesn't seem to be 
> GC-related".
>
> What is jvisualvm showing you related to GC and/or memory allocation when 
> you tried the 18-core version with 18 threads in the same process?
>
> Even memory allocation could become a point of contention, depending upon 
> how the memory allocation works with many threads.  e.g. Depends on whether 
> a thread gets a large chunk of memory on a global lock, and then locally 
> carves it up into the small pieces it needs for each individual Java 'new' 
> allocation, or gets a global lock for every 'new'.  The latter would give 
> terrible performance as # cores increase, but I don't know how to tell 
> whether that is the case, except by knowing more about how the memory 
> allocator is implemented in your JVM.  Maybe digging through OpenJDK source 
> code in the right place would tell?
>
> Andy
>
> On Tue, Nov 17, 2015 at 2:00 AM, David Iba  > wrote:
>
>> correction: that "do" should be a "doall".  (My actual test code was a 
>> bit different, but each run printed some info when it started so it doesn't 
>> have to do with delayed evaluation of lazy seq's or anything).
>>
>>
>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>>>
>>> Andy:  Interesting.  Thanks for educating me on the fact that atom 
>>> swap's don't use the STM.  Your theory seems plausible... I will try those 
>>> tests next time I launch the 18-core instance, but yeah, not sure how 
>>> illuminating the results will be.
>>>
>>> Niels: along the lines of this (so that each thread prints its time as 
>>> well as printing the overall time):
>>>
>>>1.   (time
>>>2.(let [f f1
>>>3.  n-runs 18
>>>4.  futs (do (for [i (range n-runs)]
>>>5. (future (time (f)]
>>>6.  (doseq [fut futs]
>>>7.@fut)))
>>>
>>>
>>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren 
>>> wrote:

 Could you also show how you are running these functions in parallel and 
 time them ? The way you start the functions can have as much impact as the 
 functions themselves.

 Regards,
 Niels

 On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:
>
> I have functions f1 and f2 below, and let's say they run in T1 and T2 
> amount of time when running a single instance/thread.  The issue I'm 
> facing 
> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and 
> for more complex funcs takes absurdly long.
>
>
>1. (defn f1 []
>2.   (apply + (range 2e9)))
>3.  
>4. ;; Note: each call to (f2) makes its own x* atom, so the 
>'swap!' should never retry.
>5. (defn f2 []
>6.   (let [x* (atom {})]
>7. (loop [i 1e9]
>8.   (when-not (zero? i)
>9. (swap! x* assoc :k i)
>10. (recur (dec i))
>
>
> Of note:
> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and 
> T2 for 4 runs in parallel)
> - running 18 f1's in parallel on the 18-core machine also parallelizes 
> well.
> - Disabling hyperthreading doesn't help.
> - Based on jvisualvm monitoring, doesn't seem to be GC-related
> - also tried on dedicated 18-core ec2 instance with same issues, so 
> not shared-tenancy-related
> - if I make a jar that runs a single f2 and launch 18 in parallel, it 
> parallelizes well (so I don't think it's machine/aws-related)
>
> Could it be that the 18 f2's in parallel on a single JVM instance is 
> overworking the STM with all the swap's?  Any other theories?
>
> Thanks!
>
 -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com 
>> 
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com 
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread Timothy Baldridge

This sort of code is somewhat the worst case situation for atoms (or really
for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS
operation that most x86 CPUs have as an instruction. If we expand swap! it
looks something like this:

(loop [old-val @x*]
  (let [new-val (assoc old-val :k i)]
(if (compare-and-swap x* old-val new-val)
   new-val
   (recur @x*)))

Compare-and-swap can be defined as "updates the content of the reference to
new-val only if the current value of the reference is equal to the
old-val).

So in essence, only one core can be modifying the contents of an atom at a
time, if the atom is modified during the execution of the swap! call, then
swap! will continue to re-run your function until it's able to update the
atom without it being modified during the function's execution.

So let's say you have some super long task that you need to integrate into
a ref, he's one way to do it, but probably not the best:

(let [a (atom 0)]
  (dotimes [x 18]
(future
(swap! a long-operation-on-score some-param

In this case long-operation-on-score will need to be re-run every time a
thread modifies the atom. However if our function only needs the state of
the ref to add to it, then we can do something like this instead:

(let [a (atom 0)]
  (dotimes [x 18]
(future
(let [score (long-operation-on-score some-param)
  (swap! a + score)

Now we only have a simple addition inside the swap! and we will have less
contention between the CPUs because they will most likely be spending more
time inside 'long-operation-on-score' instead of inside the swap.

*TL;DR*: do as little work as possible inside swap! the more you have
inside swap! the higher chance you will have of throwing away work due to
swap! retries.

Timothy

On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta  wrote:

> by the way, have you tried both Oracle and Open JDK with the same results?
> Gianluca
>
> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote:
>>
>> David, you say "Based on jvisualvm monitoring, doesn't seem to be
>> GC-related".
>>
>> What is jvisualvm showing you related to GC and/or memory allocation when
>> you tried the 18-core version with 18 threads in the same process?
>>
>> Even memory allocation could become a point of contention, depending upon
>> how the memory allocation works with many threads.  e.g. Depends on whether
>> a thread gets a large chunk of memory on a global lock, and then locally
>> carves it up into the small pieces it needs for each individual Java 'new'
>> allocation, or gets a global lock for every 'new'.  The latter would give
>> terrible performance as # cores increase, but I don't know how to tell
>> whether that is the case, except by knowing more about how the memory
>> allocator is implemented in your JVM.  Maybe digging through OpenJDK source
>> code in the right place would tell?
>>
>> Andy
>>
>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba  wrote:
>>
>>> correction: that "do" should be a "doall".  (My actual test code was a
>>> bit different, but each run printed some info when it started so it doesn't
>>> have to do with delayed evaluation of lazy seq's or anything).
>>>
>>>
>>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:

 Andy:  Interesting.  Thanks for educating me on the fact that atom
 swap's don't use the STM.  Your theory seems plausible... I will try those
 tests next time I launch the 18-core instance, but yeah, not sure how
 illuminating the results will be.

 Niels: along the lines of this (so that each thread prints its time as
 well as printing the overall time):

1.   (time
2.(let [f f1
3.  n-runs 18
4.  futs (do (for [i (range n-runs)]
5. (future (time (f)]
6.  (doseq [fut futs]
7.@fut)))

 On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren
 wrote:
>
> Could you also show how you are running these functions in parallel
> and time them ? The way you start the functions can have as much impact as
> the functions themselves.
>
> Regards,
> Niels
>
> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:
>>
>> I have functions f1 and f2 below, and let's say they run in T1 and T2
>> amount of time when running a single instance/thread.  The issue I'm 
>> facing
>> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and
>> for more complex funcs takes absurdly long.
>>
>>
>>1. (defn f1 []
>>2.   (apply + (range 2e9)))
>>3.
>>4. ;; Note: each call to (f2) makes its own x* atom, so the
>>'swap!' should never retry.
>>5. (defn f2 []
>>6.   (let [x* (atom {})]
>>7. (loop [i 1e9]
>>8.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread Timothy Baldridge

Oh, then I completely mis-understood the problem at hand here. If that's
the case then do the following:

Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes
anything.

Timothy


On Wed, Nov 18, 2015 at 9:00 AM, David Iba  wrote:

> Timothy:  Each thread (call of f2) creates its own "local" atom, so I
> don't think there should be any swap retries.
>
> Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into
> trying Oracle and report back.
>
> Andy:  jvisualvm was showing pretty much all of the memory allocated in
> the eden space and a little in the first survivor (no major/full GC's), and
> total GC Time was very minimal.
>
> I'm in the middle of running some more tests and will report back when I
> get a chance today or tomorrow.  Thanks for all the feedback on this!
>
> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:
>>
>> This sort of code is somewhat the worst case situation for atoms (or
>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS
>> operation that most x86 CPUs have as an instruction. If we expand swap! it
>> looks something like this:
>>
>> (loop [old-val @x*]
>>   (let [new-val (assoc old-val :k i)]
>> (if (compare-and-swap x* old-val new-val)
>>new-val
>>(recur @x*)))
>>
>> Compare-and-swap can be defined as "updates the content of the reference
>> to new-val only if the current value of the reference is equal to the
>> old-val).
>>
>> So in essence, only one core can be modifying the contents of an atom at
>> a time, if the atom is modified during the execution of the swap! call,
>> then swap! will continue to re-run your function until it's able to update
>> the atom without it being modified during the function's execution.
>>
>> So let's say you have some super long task that you need to integrate
>> into a ref, he's one way to do it, but probably not the best:
>>
>> (let [a (atom 0)]
>>   (dotimes [x 18]
>> (future
>> (swap! a long-operation-on-score some-param
>>
>>
>> In this case long-operation-on-score will need to be re-run every time a
>> thread modifies the atom. However if our function only needs the state of
>> the ref to add to it, then we can do something like this instead:
>>
>> (let [a (atom 0)]
>>   (dotimes [x 18]
>> (future
>> (let [score (long-operation-on-score some-param)
>>   (swap! a + score)
>>
>> Now we only have a simple addition inside the swap! and we will have less
>> contention between the CPUs because they will most likely be spending more
>> time inside 'long-operation-on-score' instead of inside the swap.
>>
>> *TL;DR*: do as little work as possible inside swap! the more you have
>> inside swap! the higher chance you will have of throwing away work due to
>> swap! retries.
>>
>> Timothy
>>
>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta 
>> wrote:
>>
>>> by the way, have you tried both Oracle and Open JDK with the same
>>> results?
>>> Gianluca
>>>
>>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote:

 David, you say "Based on jvisualvm monitoring, doesn't seem to be
 GC-related".

 What is jvisualvm showing you related to GC and/or memory allocation
 when you tried the 18-core version with 18 threads in the same process?

 Even memory allocation could become a point of contention, depending
 upon how the memory allocation works with many threads.  e.g. Depends on
 whether a thread gets a large chunk of memory on a global lock, and then
 locally carves it up into the small pieces it needs for each individual
 Java 'new' allocation, or gets a global lock for every 'new'.  The latter
 would give terrible performance as # cores increase, but I don't know how
 to tell whether that is the case, except by knowing more about how the
 memory allocator is implemented in your JVM.  Maybe digging through OpenJDK
 source code in the right place would tell?

 Andy

 On Tue, Nov 17, 2015 at 2:00 AM, David Iba  wrote:

> correction: that "do" should be a "doall".  (My actual test code was a
> bit different, but each run printed some info when it started so it 
> doesn't
> have to do with delayed evaluation of lazy seq's or anything).
>
>
> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>>
>> Andy:  Interesting.  Thanks for educating me on the fact that atom
>> swap's don't use the STM.  Your theory seems plausible... I will try 
>> those
>> tests next time I launch the 18-core instance, but yeah, not sure how
>> illuminating the results will be.
>>
>> Niels: along the lines of this (so that each thread prints its time
>> as well as printing the overall time):
>>
>>1.   (time
>>2.(let [f f1
>>3.  n-runs 18
>>4.  futs (do (for

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-18 Thread David Iba

OK, have a few updates to report:

   - Oracle vs OpenJDK did not make a difference
   - Whenever I run N>1 threads calling any of these functions with 
   swap/vswap, there is some overhead compared to running 18 separate 
   single-run processes in parallel.  This overhead seems to increase as N 
   increases.
   - For both swap and vswap, the function timings from running 18 futures 
  (from one JVM) show about 1.5X the time from running 18 separate JVM 
  processes.
  - For the swap version (f2), very often a few of the calls would go 
  rogue and take around 3X the time of the others.
 - this did not happen for the vswap version of f2.
  - Running 9 processes with 2 f2-calling threads each was maybe 4% 
   slower than 18 processes of 1.
   - Running 4 processes with 4 f2-calling threads each was mostly the same 
   speed as the 18x1, but there were a couple of those rogue threads that took 
   2-3X the time of the others.

Any ideas?

On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote:
>
> No worries.  Thanks, I'll give that a try as well!
>
> On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote:
>>
>> Oh, then I completely mis-understood the problem at hand here. If that's 
>> the case then do the following:
>>
>> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes 
>> anything. 
>>
>> Timothy
>>
>>
>> On Wed, Nov 18, 2015 at 9:00 AM, David Iba  wrote:
>>
>>> Timothy:  Each thread (call of f2) creates its own "local" atom, so I 
>>> don't think there should be any swap retries.
>>>
>>> Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into 
>>> trying Oracle and report back.
>>>
>>> Andy:  jvisualvm was showing pretty much all of the memory allocated in 
>>> the eden space and a little in the first survivor (no major/full GC's), and 
>>> total GC Time was very minimal.
>>>
>>> I'm in the middle of running some more tests and will report back when I 
>>> get a chance today or tomorrow.  Thanks for all the feedback on this!
>>>
>>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:

 This sort of code is somewhat the worst case situation for atoms (or 
 really for CAS). Clojure's swap! is based off the "compare-and-swap" or 
 CAS 
 operation that most x86 CPUs have as an instruction. If we expand swap! it 
 looks something like this:

 (loop [old-val @x*]
   (let [new-val (assoc old-val :k i)]
 (if (compare-and-swap x* old-val new-val)
new-val
(recur @x*)))

 Compare-and-swap can be defined as "updates the content of the 
 reference to new-val only if the current value of the reference is equal 
 to 
 the old-val). 

 So in essence, only one core can be modifying the contents of an atom 
 at a time, if the atom is modified during the execution of the swap! call, 
 then swap! will continue to re-run your function until it's able to update 
 the atom without it being modified during the function's execution. 

 So let's say you have some super long task that you need to integrate 
 into a ref, he's one way to do it, but probably not the best:

 (let [a (atom 0)]
   (dotimes [x 18]
 (future
 (swap! a long-operation-on-score some-param


 In this case long-operation-on-score will need to be re-run every time 
 a thread modifies the atom. However if our function only needs the state 
 of 
 the ref to add to it, then we can do something like this instead:

 (let [a (atom 0)]
   (dotimes [x 18]
 (future
 (let [score (long-operation-on-score some-param)
   (swap! a + score)

 Now we only have a simple addition inside the swap! and we will have 
 less contention between the CPUs because they will most likely be spending 
 more time inside 'long-operation-on-score' instead of inside the swap.

 *TL;DR*: do as little work as possible inside swap! the more you have 
 inside swap! the higher chance you will have of throwing away work due to 
 swap! retries. 

 Timothy

 On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta  
 wrote:

> by the way, have you tried both Oracle and Open JDK with the same 
> results?
> Gianluca
>
> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut 
> wrote:
>>
>> David, you say "Based on jvisualvm monitoring, doesn't seem to be 
>> GC-related".
>>
>> What is jvisualvm showing you related to GC and/or memory allocation 
>> when you tried the 18-core version with 18 threads in the same process?
>>
>> Even memory allocation could become a point of contention, depending 
>> upon how the memory allocation works with many threads.  e.g. Depends on 
>> whether a thread gets a large chunk of memory

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread Niels van Klaveren

Could you also show how you are running these functions in parallel and 
time them ? The way you start the functions can have as much impact as the 
functions themselves.

Regards,
Niels

On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:
>
> I have functions f1 and f2 below, and let's say they run in T1 and T2 
> amount of time when running a single instance/thread.  The issue I'm facing 
> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and 
> for more complex funcs takes absurdly long.
>
>
>1. (defn f1 []
>2.   (apply + (range 2e9)))
>3.  
>4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' 
>should never retry.
>5. (defn f2 []
>6.   (let [x* (atom {})]
>7. (loop [i 1e9]
>8.   (when-not (zero? i)
>9. (swap! x* assoc :k i)
>10. (recur (dec i))
>
>
> Of note:
> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2 
> for 4 runs in parallel)
> - running 18 f1's in parallel on the 18-core machine also parallelizes 
> well.
> - Disabling hyperthreading doesn't help.
> - Based on jvisualvm monitoring, doesn't seem to be GC-related
> - also tried on dedicated 18-core ec2 instance with same issues, so not 
> shared-tenancy-related
> - if I make a jar that runs a single f2 and launch 18 in parallel, it 
> parallelizes well (so I don't think it's machine/aws-related)
>
> Could it be that the 18 f2's in parallel on a single JVM instance is 
> overworking the STM with all the swap's?  Any other theories?
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread David Iba

Andy:  Interesting.  Thanks for educating me on the fact that atom swap's 
don't use the STM.  Your theory seems plausible... I will try those tests 
next time I launch the 18-core instance, but yeah, not sure how 
illuminating the results will be.

Niels: along the lines of this (so that each thread prints its time as well 
as printing the overall time):

   1.   (time
   2.(let [f f1
   3.  n-runs 18
   4.  futs (do (for [i (range n-runs)]
   5. (future (time (f)]
   6.  (doseq [fut futs]
   7.@fut)))
   

On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren wrote:
>
> Could you also show how you are running these functions in parallel and 
> time them ? The way you start the functions can have as much impact as the 
> functions themselves.
>
> Regards,
> Niels
>
> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:
>>
>> I have functions f1 and f2 below, and let's say they run in T1 and T2 
>> amount of time when running a single instance/thread.  The issue I'm facing 
>> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and 
>> for more complex funcs takes absurdly long.
>>
>>
>>1. (defn f1 []
>>2.   (apply + (range 2e9)))
>>3.  
>>4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' 
>>should never retry.
>>5. (defn f2 []
>>6.   (let [x* (atom {})]
>>7. (loop [i 1e9]
>>8.   (when-not (zero? i)
>>9. (swap! x* assoc :k i)
>>10. (recur (dec i))
>>
>>
>> Of note:
>> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2 
>> for 4 runs in parallel)
>> - running 18 f1's in parallel on the 18-core machine also parallelizes 
>> well.
>> - Disabling hyperthreading doesn't help.
>> - Based on jvisualvm monitoring, doesn't seem to be GC-related
>> - also tried on dedicated 18-core ec2 instance with same issues, so not 
>> shared-tenancy-related
>> - if I make a jar that runs a single f2 and launch 18 in parallel, it 
>> parallelizes well (so I don't think it's machine/aws-related)
>>
>> Could it be that the 18 f2's in parallel on a single JVM instance is 
>> overworking the STM with all the swap's?  Any other theories?
>>
>> Thanks!
>>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread David Iba

correction: that "do" should be a "doall".  (My actual test code was a bit 
different, but each run printed some info when it started so it doesn't 
have to do with delayed evaluation of lazy seq's or anything).

On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>
> Andy:  Interesting.  Thanks for educating me on the fact that atom swap's 
> don't use the STM.  Your theory seems plausible... I will try those tests 
> next time I launch the 18-core instance, but yeah, not sure how 
> illuminating the results will be.
>
> Niels: along the lines of this (so that each thread prints its time as 
> well as printing the overall time):
>
>1.   (time
>2.(let [f f1
>3.  n-runs 18
>4.  futs (do (for [i (range n-runs)]
>5. (future (time (f)]
>6.  (doseq [fut futs]
>7.@fut)))
>
>
> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren 
> wrote:
>>
>> Could you also show how you are running these functions in parallel and 
>> time them ? The way you start the functions can have as much impact as the 
>> functions themselves.
>>
>> Regards,
>> Niels
>>
>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:
>>>
>>> I have functions f1 and f2 below, and let's say they run in T1 and T2 
>>> amount of time when running a single instance/thread.  The issue I'm facing 
>>> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and 
>>> for more complex funcs takes absurdly long.
>>>
>>>
>>>1. (defn f1 []
>>>2.   (apply + (range 2e9)))
>>>3.  
>>>4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' 
>>>should never retry.
>>>5. (defn f2 []
>>>6.   (let [x* (atom {})]
>>>7. (loop [i 1e9]
>>>8.   (when-not (zero? i)
>>>9. (swap! x* assoc :k i)
>>>10. (recur (dec i))
>>>
>>>
>>> Of note:
>>> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and 
>>> T2 for 4 runs in parallel)
>>> - running 18 f1's in parallel on the 18-core machine also parallelizes 
>>> well.
>>> - Disabling hyperthreading doesn't help.
>>> - Based on jvisualvm monitoring, doesn't seem to be GC-related
>>> - also tried on dedicated 18-core ec2 instance with same issues, so not 
>>> shared-tenancy-related
>>> - if I make a jar that runs a single f2 and launch 18 in parallel, it 
>>> parallelizes well (so I don't think it's machine/aws-related)
>>>
>>> Could it be that the 18 f2's in parallel on a single JVM instance is 
>>> overworking the STM with all the swap's?  Any other theories?
>>>
>>> Thanks!
>>>
>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-17 Thread Andy Fingerhut

David, you say "Based on jvisualvm monitoring, doesn't seem to be
GC-related".

What is jvisualvm showing you related to GC and/or memory allocation when
you tried the 18-core version with 18 threads in the same process?

Even memory allocation could become a point of contention, depending upon
how the memory allocation works with many threads.  e.g. Depends on whether
a thread gets a large chunk of memory on a global lock, and then locally
carves it up into the small pieces it needs for each individual Java 'new'
allocation, or gets a global lock for every 'new'.  The latter would give
terrible performance as # cores increase, but I don't know how to tell
whether that is the case, except by knowing more about how the memory
allocator is implemented in your JVM.  Maybe digging through OpenJDK source
code in the right place would tell?

Andy

On Tue, Nov 17, 2015 at 2:00 AM, David Iba  wrote:

> correction: that "do" should be a "doall".  (My actual test code was a bit
> different, but each run printed some info when it started so it doesn't
> have to do with delayed evaluation of lazy seq's or anything).
>
>
> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>>
>> Andy:  Interesting.  Thanks for educating me on the fact that atom swap's
>> don't use the STM.  Your theory seems plausible... I will try those tests
>> next time I launch the 18-core instance, but yeah, not sure how
>> illuminating the results will be.
>>
>> Niels: along the lines of this (so that each thread prints its time as
>> well as printing the overall time):
>>
>>1.   (time
>>2.(let [f f1
>>3.  n-runs 18
>>4.  futs (do (for [i (range n-runs)]
>>5. (future (time (f)]
>>6.  (doseq [fut futs]
>>7.@fut)))
>>
>>
>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren
>> wrote:
>>>
>>> Could you also show how you are running these functions in parallel and
>>> time them ? The way you start the functions can have as much impact as the
>>> functions themselves.
>>>
>>> Regards,
>>> Niels
>>>
>>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote:

 I have functions f1 and f2 below, and let's say they run in T1 and T2
 amount of time when running a single instance/thread.  The issue I'm facing
 is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and
 for more complex funcs takes absurdly long.


1. (defn f1 []
2.   (apply + (range 2e9)))
3.
4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!'
should never retry.
5. (defn f2 []
6.   (let [x* (atom {})]
7. (loop [i 1e9]
8.   (when-not (zero? i)
9. (swap! x* assoc :k i)
10. (recur (dec i))


 Of note:
 - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and
 T2 for 4 runs in parallel)
 - running 18 f1's in parallel on the 18-core machine also parallelizes
 well.
 - Disabling hyperthreading doesn't help.
 - Based on jvisualvm monitoring, doesn't seem to be GC-related
 - also tried on dedicated 18-core ec2 instance with same issues, so not
 shared-tenancy-related
 - if I make a jar that runs a single f2 and launch 18 in parallel, it
 parallelizes well (so I don't think it's machine/aws-related)

 Could it be that the 18 f2's in parallel on a single JVM instance is
 overworking the STM with all the swap's?  Any other theories?

 Thanks!

>>> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options,

Re: Poor parallelization performance across 18 cores (but not 4)

2015-11-16 Thread Andy Fingerhut

There is no STM involved if you only have atoms, and no refs, so it can't
be STM-related.

I have a conjecture, but don't yet have a suggestion for an experiment that
would prove or disprove it.

The JVM memory model requires that changes to values that should be visible
to all threads, like swap! on an atom, require making state changes to
those values visible to all threads, which I think may often be implemented
by flushing any local cache values to main memory, even if no other thread
actually reads the value.

Your f1 code only does thread-local computation with no requirement to make
its results visible to other threads.

Your f2 code must  make its results visible to other threads.  Not only
that, but the values it must make visible are allocating new memory with
each new value (via calls to assoc).

Perhaps main memory is not fast enough to keep up with 18 cores running f2
at full rate, but it is fast enough to keep up with 4 cores running f2 at
full rate?

Maybe collecting data for time to completion for all number of cores
running f2 from 4 up to 18 on the same hardware would be illuminating?
Especially if it showed that there was some maximum number of 'f2
iterations per second' total that was equal across any number of cores
running f2 in parallel?

I am not sure whether that would explain your results of running 18
separate processes each running 1 thread of f2 in parallel getting full
speedup, unless the JVM can tell only one thread is running and thus no
flushes to main memory are required.  Maybe try running 9 processes, each
with 2 f2 threads, to see if it is as bad as 1 process with 18 threads?

Andy

On Mon, Nov 16, 2015 at 9:01 PM, David Iba  wrote:

> I have functions f1 and f2 below, and let's say they run in T1 and T2
> amount of time when running a single instance/thread.  The issue I'm facing
> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and
> for more complex funcs takes absurdly long.
>
>
>1. (defn f1 []
>2.   (apply + (range 2e9)))
>3.
>4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!'
>should never retry.
>5. (defn f2 []
>6.   (let [x* (atom {})]
>7. (loop [i 1e9]
>8.   (when-not (zero? i)
>9. (swap! x* assoc :k i)
>10. (recur (dec i))
>
>
> Of note:
> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2
> for 4 runs in parallel)
> - running 18 f1's in parallel on the 18-core machine also parallelizes
> well.
> - Disabling hyperthreading doesn't help.
> - Based on jvisualvm monitoring, doesn't seem to be GC-related
> - also tried on dedicated 18-core ec2 instance with same issues, so not
> shared-tenancy-related
> - if I make a jar that runs a single f2 and launch 18 in parallel, it
> parallelizes well (so I don't think it's machine/aws-related)
>
> Could it be that the 18 f2's in parallel on a single JVM instance is
> overworking the STM with all the swap's?  Any other theories?
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

Re: Poor parallelization performance across 18 cores (but not 4)

16 matches

Site Navigation

Mail list logo

Footer information