Re: Poor parallelization performance across 18 cores (but not 4)
Andy: Heh, glad to hear that I'm not the only one facing this issue, and I appreciate the encouragement since it's been kicking my ass the past week :) On the bright side, as someone coming from more of a math background, this has forced me to learn a lot about how cpus/threads/memory/etc. work! Herwig: I just got a chance to look through that thread you linked - sounds very very similar to what I'm encountering! Niels: Glad to hear you're able to replicate the behavior. I was also using claypoole's unordered pmap myself but excluded it in my code examples for simplicity :) One thing to note that's tricky about benchmarking with hyperthreading enabled is that for fully CPU-bound jobs that don't share any cache and whatnot, if you're using all virtual-cores (8 in your case), a 2X slowdown would be expected. Furthermore, if you launch less than the number of vCPUs available, it's possible that both threads get assigned to the same vCPU and thus again might run in 2X the time. I noticed this seemed to happen more when the threads were spawned from the same java process (probably b/c it's presumed they can share cache) as opposed to separate processes. So IMO the best way to test in this setting (without disabling HT) is to max out the vCPUs and compare against the expected 2X slowdown. I think the "multiple threads allocating simultaneously" hypothesis makes the most sense so far. This TLAB setting is interesting and I'll definitely give adjusting that a try - is setting the jvm option "-XX:+MinTLABSize" (like in the stackoverflow link Andy posted) the best way to go about this? On Friday, November 20, 2015 at 5:53:42 PM UTC+9, Niels van Klaveren wrote: > > For what it's worth, here's the code I've been using while experimenting > along with this at home. > > Basically, it's a for loop over a collection of functions and a collection > of core counts, running a fixed number of tasks. > So every function it can step up from running f on one core n times to f > on x cores one time. I use com.climate/claypoole's unordered pmap, which > gives a nice abstraction over spawning futures. > > Included are two function sets: summation and key assoc (since the > cross-comparison used in the OP bugged me a bit) > Suggestions for alterations are welcome, but tests I ran seem to show that > all variants of the functions slow down considerably the more it is run in > parallel. (2-3x overhead compared to a single core run). > > Granted, I only could test this on a 4 core (8 hyperthreading) machine. > > Thursday, November 19, 2015 at 9:58:47 PM UTC+1, Andy Fingerhut wrote: >> >> David: >> >> No new suggestions to add right now. Herwig's suggestion that it could >> be the Java allocator has some evidence for it given your results. I'm not >> sure whether this StackOverflow Q on TLAB is fully accurate, but it may >> provide some useful info: >> >> >> http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab >> >> I mainly wanted to give you a virtual high-five, kudos, and thank-you >> thank-you thank-you thank-you thank-you for taking the time to run these >> experiments. Similar performance issues with many threads in the same JVM >> on a many-core machine have come up before in the past, and so far I don't >> know if anyone has gotten to the bottom of it yet. >> >> Andy >> >> >> On Wed, Nov 18, 2015 at 10:36 PM, David Ibawrote: >> >>> OK, have a few updates to report: >>> >>>- Oracle vs OpenJDK did not make a difference >>>- Whenever I run N>1 threads calling any of these functions with >>>swap/vswap, there is some overhead compared to running 18 separate >>>single-run processes in parallel. This overhead seems to increase as N >>>increases. >>>- For both swap and vswap, the function timings from running 18 >>> futures (from one JVM) show about 1.5X the time from running 18 >>> separate >>> JVM processes. >>> - For the swap version (f2), very often a few of the calls would >>> go rogue and take around 3X the time of the others. >>> - this did not happen for the vswap version of f2. >>> - Running 9 processes with 2 f2-calling threads each was maybe 4% >>>slower than 18 processes of 1. >>>- Running 4 processes with 4 f2-calling threads each was mostly the >>>same speed as the 18x1, but there were a couple of those rogue threads >>> that >>>took 2-3X the time of the others. >>> >>> Any ideas? >>> >>> On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote: No worries. Thanks, I'll give that a try as well! On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote: > > Oh, then I completely mis-understood the problem at hand here. If > that's the case then do the following: > > Change "atom" to "volatile!" and "swap!" to "vswap!". See if that > changes anything. >
Re: Poor parallelization performance across 18 cores (but not 4)
This reminds me of another thread, where performance issues related to concurrent allocation were explored in depth: https://groups.google.com/d/topic/clojure/48W2eff3caU/discussion The main takeaway for me was, that Hotspot will slow down pretty dramatically, as soon as there are two threads allocating. Could you try: a) how performance develops, when you take out the allocation (assoc) b) if increasing Hotspot's TLAB size will make any difference? -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Poor parallelization performance across 18 cores (but not 4)
David: No new suggestions to add right now. Herwig's suggestion that it could be the Java allocator has some evidence for it given your results. I'm not sure whether this StackOverflow Q on TLAB is fully accurate, but it may provide some useful info: http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab I mainly wanted to give you a virtual high-five, kudos, and thank-you thank-you thank-you thank-you thank-you for taking the time to run these experiments. Similar performance issues with many threads in the same JVM on a many-core machine have come up before in the past, and so far I don't know if anyone has gotten to the bottom of it yet. Andy On Wed, Nov 18, 2015 at 10:36 PM, David Ibawrote: > OK, have a few updates to report: > >- Oracle vs OpenJDK did not make a difference >- Whenever I run N>1 threads calling any of these functions with >swap/vswap, there is some overhead compared to running 18 separate >single-run processes in parallel. This overhead seems to increase as N >increases. >- For both swap and vswap, the function timings from running 18 > futures (from one JVM) show about 1.5X the time from running 18 separate > JVM processes. > - For the swap version (f2), very often a few of the calls would go > rogue and take around 3X the time of the others. > - this did not happen for the vswap version of f2. > - Running 9 processes with 2 f2-calling threads each was maybe 4% >slower than 18 processes of 1. >- Running 4 processes with 4 f2-calling threads each was mostly the >same speed as the 18x1, but there were a couple of those rogue threads that >took 2-3X the time of the others. > > Any ideas? > > On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote: >> >> No worries. Thanks, I'll give that a try as well! >> >> On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote: >>> >>> Oh, then I completely mis-understood the problem at hand here. If that's >>> the case then do the following: >>> >>> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that >>> changes anything. >>> >>> Timothy >>> >>> >>> On Wed, Nov 18, 2015 at 9:00 AM, David Iba wrote: >>> Timothy: Each thread (call of f2) creates its own "local" atom, so I don't think there should be any swap retries. Gianluca: Good idea! I've only tried OpenJDK, but I will look into trying Oracle and report back. Andy: jvisualvm was showing pretty much all of the memory allocated in the eden space and a little in the first survivor (no major/full GC's), and total GC Time was very minimal. I'm in the middle of running some more tests and will report back when I get a chance today or tomorrow. Thanks for all the feedback on this! On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote: > > This sort of code is somewhat the worst case situation for atoms (or > really for CAS). Clojure's swap! is based off the "compare-and-swap" or > CAS > operation that most x86 CPUs have as an instruction. If we expand swap! it > looks something like this: > > (loop [old-val @x*] > (let [new-val (assoc old-val :k i)] > (if (compare-and-swap x* old-val new-val) >new-val >(recur @x*))) > > Compare-and-swap can be defined as "updates the content of the > reference to new-val only if the current value of the reference is equal > to > the old-val). > > So in essence, only one core can be modifying the contents of an atom > at a time, if the atom is modified during the execution of the swap! call, > then swap! will continue to re-run your function until it's able to update > the atom without it being modified during the function's execution. > > So let's say you have some super long task that you need to integrate > into a ref, he's one way to do it, but probably not the best: > > (let [a (atom 0)] > (dotimes [x 18] > (future > (swap! a long-operation-on-score some-param > > > In this case long-operation-on-score will need to be re-run every time > a thread modifies the atom. However if our function only needs the state > of > the ref to add to it, then we can do something like this instead: > > (let [a (atom 0)] > (dotimes [x 18] > (future > (let [score (long-operation-on-score some-param) > (swap! a + score) > > Now we only have a simple addition inside the swap! and we will have > less contention between the CPUs because they will most likely be spending > more time inside 'long-operation-on-score' instead of inside the swap. > > *TL;DR*: do as little work as possible inside swap! the more you have > inside swap! the higher
Re: Poor parallelization performance across 18 cores (but not 4)
On Thursday, November 19, 2015 at 1:36:59 AM UTC-5, David Iba wrote: > > OK, have a few updates to report: > >- Oracle vs OpenJDK did not make a difference >- Whenever I run N>1 threads calling any of these functions with >swap/vswap, there is some overhead compared to running 18 separate >single-run processes in parallel. This overhead seems to increase as N >increases. >- For both swap and vswap, the function timings from running 18 > futures (from one JVM) show about 1.5X the time from running 18 > separate > JVM processes. > - For the swap version (f2), very often a few of the calls would go > rogue and take around 3X the time of the others. > - this did not happen for the vswap version of f2. > - Running 9 processes with 2 f2-calling threads each was maybe 4% >slower than 18 processes of 1. >- Running 4 processes with 4 f2-calling threads each was mostly the >same speed as the 18x1, but there were a couple of those rogue threads > that >took 2-3X the time of the others. > > Any ideas? > Try a one-element array and aset, and see if that's faster than atom/swap and volatile/vswap. The latter two have memory barriers, the former does not, so if it's flushing the CPU cache that's the key here, aset should be faster, but if it's something else, it will probably be the same speed. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Poor parallelization performance across 18 cores (but not 4)
Yeah, I actually tried using aset as well, and was still seeing these "rogue" threads taking much longer (although the ones that did finish in a normal amount of time had very similar completion times to those running in their own process.) Herwig: I will try those suggestions when I get a chance. On Thu, Nov 19, 2015 at 6:19 PM, Fluid Dynamicswrote: > On Thursday, November 19, 2015 at 1:36:59 AM UTC-5, David Iba wrote: >> >> OK, have a few updates to report: >> >>- Oracle vs OpenJDK did not make a difference >>- Whenever I run N>1 threads calling any of these functions with >>swap/vswap, there is some overhead compared to running 18 separate >>single-run processes in parallel. This overhead seems to increase as N >>increases. >>- For both swap and vswap, the function timings from running 18 >> futures (from one JVM) show about 1.5X the time from running 18 >> separate >> JVM processes. >> - For the swap version (f2), very often a few of the calls would >> go rogue and take around 3X the time of the others. >> - this did not happen for the vswap version of f2. >> - Running 9 processes with 2 f2-calling threads each was maybe 4% >>slower than 18 processes of 1. >>- Running 4 processes with 4 f2-calling threads each was mostly the >>same speed as the 18x1, but there were a couple of those rogue threads >> that >>took 2-3X the time of the others. >> >> Any ideas? >> > > Try a one-element array and aset, and see if that's faster than atom/swap > and volatile/vswap. The latter two have memory barriers, the former does > not, so if it's flushing the CPU cache that's the key here, aset should be > faster, but if it's something else, it will probably be the same speed. > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to a topic in the > Google Groups "Clojure" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/clojure/W-sddnit69Q/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Poor parallelization performance across 18 cores (but not 4)
Timothy: Each thread (call of f2) creates its own "local" atom, so I don't think there should be any swap retries. Gianluca: Good idea! I've only tried OpenJDK, but I will look into trying Oracle and report back. Andy: jvisualvm was showing pretty much all of the memory allocated in the eden space and a little in the first survivor (no major/full GC's), and total GC Time was very minimal. I'm in the middle of running some more tests and will report back when I get a chance today or tomorrow. Thanks for all the feedback on this! On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote: > > This sort of code is somewhat the worst case situation for atoms (or > really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS > operation that most x86 CPUs have as an instruction. If we expand swap! it > looks something like this: > > (loop [old-val @x*] > (let [new-val (assoc old-val :k i)] > (if (compare-and-swap x* old-val new-val) >new-val >(recur @x*))) > > Compare-and-swap can be defined as "updates the content of the reference > to new-val only if the current value of the reference is equal to the > old-val). > > So in essence, only one core can be modifying the contents of an atom at a > time, if the atom is modified during the execution of the swap! call, then > swap! will continue to re-run your function until it's able to update the > atom without it being modified during the function's execution. > > So let's say you have some super long task that you need to integrate into > a ref, he's one way to do it, but probably not the best: > > (let [a (atom 0)] > (dotimes [x 18] > (future > (swap! a long-operation-on-score some-param > > > In this case long-operation-on-score will need to be re-run every time a > thread modifies the atom. However if our function only needs the state of > the ref to add to it, then we can do something like this instead: > > (let [a (atom 0)] > (dotimes [x 18] > (future > (let [score (long-operation-on-score some-param) > (swap! a + score) > > Now we only have a simple addition inside the swap! and we will have less > contention between the CPUs because they will most likely be spending more > time inside 'long-operation-on-score' instead of inside the swap. > > *TL;DR*: do as little work as possible inside swap! the more you have > inside swap! the higher chance you will have of throwing away work due to > swap! retries. > > Timothy > > On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta> wrote: > >> by the way, have you tried both Oracle and Open JDK with the same results? >> Gianluca >> >> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote: >>> >>> David, you say "Based on jvisualvm monitoring, doesn't seem to be >>> GC-related". >>> >>> What is jvisualvm showing you related to GC and/or memory allocation >>> when you tried the 18-core version with 18 threads in the same process? >>> >>> Even memory allocation could become a point of contention, depending >>> upon how the memory allocation works with many threads. e.g. Depends on >>> whether a thread gets a large chunk of memory on a global lock, and then >>> locally carves it up into the small pieces it needs for each individual >>> Java 'new' allocation, or gets a global lock for every 'new'. The latter >>> would give terrible performance as # cores increase, but I don't know how >>> to tell whether that is the case, except by knowing more about how the >>> memory allocator is implemented in your JVM. Maybe digging through OpenJDK >>> source code in the right place would tell? >>> >>> Andy >>> >>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba wrote: >>> correction: that "do" should be a "doall". (My actual test code was a bit different, but each run printed some info when it started so it doesn't have to do with delayed evaluation of lazy seq's or anything). On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: > > Andy: Interesting. Thanks for educating me on the fact that atom > swap's don't use the STM. Your theory seems plausible... I will try > those > tests next time I launch the 18-core instance, but yeah, not sure how > illuminating the results will be. > > Niels: along the lines of this (so that each thread prints its time as > well as printing the overall time): > >1. (time >2.(let [f f1 >3. n-runs 18 >4. futs (do (for [i (range n-runs)] >5. (future (time (f)] >6. (doseq [fut futs] >7.@fut))) > > > On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren > wrote: >> >> Could you also show how you are running these functions in parallel >> and time them ? The way
Re: Poor parallelization performance across 18 cores (but not 4)
No worries. Thanks, I'll give that a try as well! On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote: > > Oh, then I completely mis-understood the problem at hand here. If that's > the case then do the following: > > Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes > anything. > > Timothy > > > On Wed, Nov 18, 2015 at 9:00 AM, David Iba> wrote: > >> Timothy: Each thread (call of f2) creates its own "local" atom, so I >> don't think there should be any swap retries. >> >> Gianluca: Good idea! I've only tried OpenJDK, but I will look into >> trying Oracle and report back. >> >> Andy: jvisualvm was showing pretty much all of the memory allocated in >> the eden space and a little in the first survivor (no major/full GC's), and >> total GC Time was very minimal. >> >> I'm in the middle of running some more tests and will report back when I >> get a chance today or tomorrow. Thanks for all the feedback on this! >> >> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote: >>> >>> This sort of code is somewhat the worst case situation for atoms (or >>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS >>> operation that most x86 CPUs have as an instruction. If we expand swap! it >>> looks something like this: >>> >>> (loop [old-val @x*] >>> (let [new-val (assoc old-val :k i)] >>> (if (compare-and-swap x* old-val new-val) >>>new-val >>>(recur @x*))) >>> >>> Compare-and-swap can be defined as "updates the content of the reference >>> to new-val only if the current value of the reference is equal to the >>> old-val). >>> >>> So in essence, only one core can be modifying the contents of an atom at >>> a time, if the atom is modified during the execution of the swap! call, >>> then swap! will continue to re-run your function until it's able to update >>> the atom without it being modified during the function's execution. >>> >>> So let's say you have some super long task that you need to integrate >>> into a ref, he's one way to do it, but probably not the best: >>> >>> (let [a (atom 0)] >>> (dotimes [x 18] >>> (future >>> (swap! a long-operation-on-score some-param >>> >>> >>> In this case long-operation-on-score will need to be re-run every time a >>> thread modifies the atom. However if our function only needs the state of >>> the ref to add to it, then we can do something like this instead: >>> >>> (let [a (atom 0)] >>> (dotimes [x 18] >>> (future >>> (let [score (long-operation-on-score some-param) >>> (swap! a + score) >>> >>> Now we only have a simple addition inside the swap! and we will have >>> less contention between the CPUs because they will most likely be spending >>> more time inside 'long-operation-on-score' instead of inside the swap. >>> >>> *TL;DR*: do as little work as possible inside swap! the more you have >>> inside swap! the higher chance you will have of throwing away work due to >>> swap! retries. >>> >>> Timothy >>> >>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta >>> wrote: >>> by the way, have you tried both Oracle and Open JDK with the same results? Gianluca On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote: > > David, you say "Based on jvisualvm monitoring, doesn't seem to be > GC-related". > > What is jvisualvm showing you related to GC and/or memory allocation > when you tried the 18-core version with 18 threads in the same process? > > Even memory allocation could become a point of contention, depending > upon how the memory allocation works with many threads. e.g. Depends on > whether a thread gets a large chunk of memory on a global lock, and then > locally carves it up into the small pieces it needs for each individual > Java 'new' allocation, or gets a global lock for every 'new'. The latter > would give terrible performance as # cores increase, but I don't know how > to tell whether that is the case, except by knowing more about how the > memory allocator is implemented in your JVM. Maybe digging through > OpenJDK > source code in the right place would tell? > > Andy > > On Tue, Nov 17, 2015 at 2:00 AM, David Iba wrote: > >> correction: that "do" should be a "doall". (My actual test code was >> a bit different, but each run printed some info when it started so it >> doesn't have to do with delayed evaluation of lazy seq's or anything). >> >> >> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: >>> >>> Andy: Interesting. Thanks for educating me on the fact that atom >>> swap's don't use the STM. Your theory seems plausible... I will try >>> those >>> tests next time I launch the 18-core instance, but yeah, not sure how
Re: Poor parallelization performance across 18 cores (but not 4)
by the way, have you tried both Oracle and Open JDK with the same results? Gianluca On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote: > > David, you say "Based on jvisualvm monitoring, doesn't seem to be > GC-related". > > What is jvisualvm showing you related to GC and/or memory allocation when > you tried the 18-core version with 18 threads in the same process? > > Even memory allocation could become a point of contention, depending upon > how the memory allocation works with many threads. e.g. Depends on whether > a thread gets a large chunk of memory on a global lock, and then locally > carves it up into the small pieces it needs for each individual Java 'new' > allocation, or gets a global lock for every 'new'. The latter would give > terrible performance as # cores increase, but I don't know how to tell > whether that is the case, except by knowing more about how the memory > allocator is implemented in your JVM. Maybe digging through OpenJDK source > code in the right place would tell? > > Andy > > On Tue, Nov 17, 2015 at 2:00 AM, David Iba> wrote: > >> correction: that "do" should be a "doall". (My actual test code was a >> bit different, but each run printed some info when it started so it doesn't >> have to do with delayed evaluation of lazy seq's or anything). >> >> >> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: >>> >>> Andy: Interesting. Thanks for educating me on the fact that atom >>> swap's don't use the STM. Your theory seems plausible... I will try those >>> tests next time I launch the 18-core instance, but yeah, not sure how >>> illuminating the results will be. >>> >>> Niels: along the lines of this (so that each thread prints its time as >>> well as printing the overall time): >>> >>>1. (time >>>2.(let [f f1 >>>3. n-runs 18 >>>4. futs (do (for [i (range n-runs)] >>>5. (future (time (f)] >>>6. (doseq [fut futs] >>>7.@fut))) >>> >>> >>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren >>> wrote: Could you also show how you are running these functions in parallel and time them ? The way you start the functions can have as much impact as the functions themselves. Regards, Niels On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: > > I have functions f1 and f2 below, and let's say they run in T1 and T2 > amount of time when running a single instance/thread. The issue I'm > facing > is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and > for more complex funcs takes absurdly long. > > >1. (defn f1 [] >2. (apply + (range 2e9))) >3. >4. ;; Note: each call to (f2) makes its own x* atom, so the >'swap!' should never retry. >5. (defn f2 [] >6. (let [x* (atom {})] >7. (loop [i 1e9] >8. (when-not (zero? i) >9. (swap! x* assoc :k i) >10. (recur (dec i)) > > > Of note: > - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and > T2 for 4 runs in parallel) > - running 18 f1's in parallel on the 18-core machine also parallelizes > well. > - Disabling hyperthreading doesn't help. > - Based on jvisualvm monitoring, doesn't seem to be GC-related > - also tried on dedicated 18-core ec2 instance with same issues, so > not shared-tenancy-related > - if I make a jar that runs a single f2 and launch 18 in parallel, it > parallelizes well (so I don't think it's machine/aws-related) > > Could it be that the 18 f2's in parallel on a single JVM instance is > overworking the STM with all the swap's? Any other theories? > > Thanks! > -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com >> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+u...@googlegroups.com . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to
Re: Poor parallelization performance across 18 cores (but not 4)
This sort of code is somewhat the worst case situation for atoms (or really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS operation that most x86 CPUs have as an instruction. If we expand swap! it looks something like this: (loop [old-val @x*] (let [new-val (assoc old-val :k i)] (if (compare-and-swap x* old-val new-val) new-val (recur @x*))) Compare-and-swap can be defined as "updates the content of the reference to new-val only if the current value of the reference is equal to the old-val). So in essence, only one core can be modifying the contents of an atom at a time, if the atom is modified during the execution of the swap! call, then swap! will continue to re-run your function until it's able to update the atom without it being modified during the function's execution. So let's say you have some super long task that you need to integrate into a ref, he's one way to do it, but probably not the best: (let [a (atom 0)] (dotimes [x 18] (future (swap! a long-operation-on-score some-param In this case long-operation-on-score will need to be re-run every time a thread modifies the atom. However if our function only needs the state of the ref to add to it, then we can do something like this instead: (let [a (atom 0)] (dotimes [x 18] (future (let [score (long-operation-on-score some-param) (swap! a + score) Now we only have a simple addition inside the swap! and we will have less contention between the CPUs because they will most likely be spending more time inside 'long-operation-on-score' instead of inside the swap. *TL;DR*: do as little work as possible inside swap! the more you have inside swap! the higher chance you will have of throwing away work due to swap! retries. Timothy On Wed, Nov 18, 2015 at 8:13 AM, gianluca tortawrote: > by the way, have you tried both Oracle and Open JDK with the same results? > Gianluca > > On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote: >> >> David, you say "Based on jvisualvm monitoring, doesn't seem to be >> GC-related". >> >> What is jvisualvm showing you related to GC and/or memory allocation when >> you tried the 18-core version with 18 threads in the same process? >> >> Even memory allocation could become a point of contention, depending upon >> how the memory allocation works with many threads. e.g. Depends on whether >> a thread gets a large chunk of memory on a global lock, and then locally >> carves it up into the small pieces it needs for each individual Java 'new' >> allocation, or gets a global lock for every 'new'. The latter would give >> terrible performance as # cores increase, but I don't know how to tell >> whether that is the case, except by knowing more about how the memory >> allocator is implemented in your JVM. Maybe digging through OpenJDK source >> code in the right place would tell? >> >> Andy >> >> On Tue, Nov 17, 2015 at 2:00 AM, David Iba wrote: >> >>> correction: that "do" should be a "doall". (My actual test code was a >>> bit different, but each run printed some info when it started so it doesn't >>> have to do with delayed evaluation of lazy seq's or anything). >>> >>> >>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: Andy: Interesting. Thanks for educating me on the fact that atom swap's don't use the STM. Your theory seems plausible... I will try those tests next time I launch the 18-core instance, but yeah, not sure how illuminating the results will be. Niels: along the lines of this (so that each thread prints its time as well as printing the overall time): 1. (time 2.(let [f f1 3. n-runs 18 4. futs (do (for [i (range n-runs)] 5. (future (time (f)] 6. (doseq [fut futs] 7.@fut))) On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren wrote: > > Could you also show how you are running these functions in parallel > and time them ? The way you start the functions can have as much impact as > the functions themselves. > > Regards, > Niels > > On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: >> >> I have functions f1 and f2 below, and let's say they run in T1 and T2 >> amount of time when running a single instance/thread. The issue I'm >> facing >> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and >> for more complex funcs takes absurdly long. >> >> >>1. (defn f1 [] >>2. (apply + (range 2e9))) >>3. >>4. ;; Note: each call to (f2) makes its own x* atom, so the >>'swap!' should never retry. >>5. (defn f2 [] >>6. (let [x* (atom {})] >>7. (loop [i 1e9] >>8.
Re: Poor parallelization performance across 18 cores (but not 4)
Oh, then I completely mis-understood the problem at hand here. If that's the case then do the following: Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes anything. Timothy On Wed, Nov 18, 2015 at 9:00 AM, David Ibawrote: > Timothy: Each thread (call of f2) creates its own "local" atom, so I > don't think there should be any swap retries. > > Gianluca: Good idea! I've only tried OpenJDK, but I will look into > trying Oracle and report back. > > Andy: jvisualvm was showing pretty much all of the memory allocated in > the eden space and a little in the first survivor (no major/full GC's), and > total GC Time was very minimal. > > I'm in the middle of running some more tests and will report back when I > get a chance today or tomorrow. Thanks for all the feedback on this! > > On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote: >> >> This sort of code is somewhat the worst case situation for atoms (or >> really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS >> operation that most x86 CPUs have as an instruction. If we expand swap! it >> looks something like this: >> >> (loop [old-val @x*] >> (let [new-val (assoc old-val :k i)] >> (if (compare-and-swap x* old-val new-val) >>new-val >>(recur @x*))) >> >> Compare-and-swap can be defined as "updates the content of the reference >> to new-val only if the current value of the reference is equal to the >> old-val). >> >> So in essence, only one core can be modifying the contents of an atom at >> a time, if the atom is modified during the execution of the swap! call, >> then swap! will continue to re-run your function until it's able to update >> the atom without it being modified during the function's execution. >> >> So let's say you have some super long task that you need to integrate >> into a ref, he's one way to do it, but probably not the best: >> >> (let [a (atom 0)] >> (dotimes [x 18] >> (future >> (swap! a long-operation-on-score some-param >> >> >> In this case long-operation-on-score will need to be re-run every time a >> thread modifies the atom. However if our function only needs the state of >> the ref to add to it, then we can do something like this instead: >> >> (let [a (atom 0)] >> (dotimes [x 18] >> (future >> (let [score (long-operation-on-score some-param) >> (swap! a + score) >> >> Now we only have a simple addition inside the swap! and we will have less >> contention between the CPUs because they will most likely be spending more >> time inside 'long-operation-on-score' instead of inside the swap. >> >> *TL;DR*: do as little work as possible inside swap! the more you have >> inside swap! the higher chance you will have of throwing away work due to >> swap! retries. >> >> Timothy >> >> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta >> wrote: >> >>> by the way, have you tried both Oracle and Open JDK with the same >>> results? >>> Gianluca >>> >>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut wrote: David, you say "Based on jvisualvm monitoring, doesn't seem to be GC-related". What is jvisualvm showing you related to GC and/or memory allocation when you tried the 18-core version with 18 threads in the same process? Even memory allocation could become a point of contention, depending upon how the memory allocation works with many threads. e.g. Depends on whether a thread gets a large chunk of memory on a global lock, and then locally carves it up into the small pieces it needs for each individual Java 'new' allocation, or gets a global lock for every 'new'. The latter would give terrible performance as # cores increase, but I don't know how to tell whether that is the case, except by knowing more about how the memory allocator is implemented in your JVM. Maybe digging through OpenJDK source code in the right place would tell? Andy On Tue, Nov 17, 2015 at 2:00 AM, David Iba wrote: > correction: that "do" should be a "doall". (My actual test code was a > bit different, but each run printed some info when it started so it > doesn't > have to do with delayed evaluation of lazy seq's or anything). > > > On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: >> >> Andy: Interesting. Thanks for educating me on the fact that atom >> swap's don't use the STM. Your theory seems plausible... I will try >> those >> tests next time I launch the 18-core instance, but yeah, not sure how >> illuminating the results will be. >> >> Niels: along the lines of this (so that each thread prints its time >> as well as printing the overall time): >> >>1. (time >>2.(let [f f1 >>3. n-runs 18 >>4. futs (do (for
Re: Poor parallelization performance across 18 cores (but not 4)
OK, have a few updates to report: - Oracle vs OpenJDK did not make a difference - Whenever I run N>1 threads calling any of these functions with swap/vswap, there is some overhead compared to running 18 separate single-run processes in parallel. This overhead seems to increase as N increases. - For both swap and vswap, the function timings from running 18 futures (from one JVM) show about 1.5X the time from running 18 separate JVM processes. - For the swap version (f2), very often a few of the calls would go rogue and take around 3X the time of the others. - this did not happen for the vswap version of f2. - Running 9 processes with 2 f2-calling threads each was maybe 4% slower than 18 processes of 1. - Running 4 processes with 4 f2-calling threads each was mostly the same speed as the 18x1, but there were a couple of those rogue threads that took 2-3X the time of the others. Any ideas? On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote: > > No worries. Thanks, I'll give that a try as well! > > On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote: >> >> Oh, then I completely mis-understood the problem at hand here. If that's >> the case then do the following: >> >> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes >> anything. >> >> Timothy >> >> >> On Wed, Nov 18, 2015 at 9:00 AM, David Ibawrote: >> >>> Timothy: Each thread (call of f2) creates its own "local" atom, so I >>> don't think there should be any swap retries. >>> >>> Gianluca: Good idea! I've only tried OpenJDK, but I will look into >>> trying Oracle and report back. >>> >>> Andy: jvisualvm was showing pretty much all of the memory allocated in >>> the eden space and a little in the first survivor (no major/full GC's), and >>> total GC Time was very minimal. >>> >>> I'm in the middle of running some more tests and will report back when I >>> get a chance today or tomorrow. Thanks for all the feedback on this! >>> >>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote: This sort of code is somewhat the worst case situation for atoms (or really for CAS). Clojure's swap! is based off the "compare-and-swap" or CAS operation that most x86 CPUs have as an instruction. If we expand swap! it looks something like this: (loop [old-val @x*] (let [new-val (assoc old-val :k i)] (if (compare-and-swap x* old-val new-val) new-val (recur @x*))) Compare-and-swap can be defined as "updates the content of the reference to new-val only if the current value of the reference is equal to the old-val). So in essence, only one core can be modifying the contents of an atom at a time, if the atom is modified during the execution of the swap! call, then swap! will continue to re-run your function until it's able to update the atom without it being modified during the function's execution. So let's say you have some super long task that you need to integrate into a ref, he's one way to do it, but probably not the best: (let [a (atom 0)] (dotimes [x 18] (future (swap! a long-operation-on-score some-param In this case long-operation-on-score will need to be re-run every time a thread modifies the atom. However if our function only needs the state of the ref to add to it, then we can do something like this instead: (let [a (atom 0)] (dotimes [x 18] (future (let [score (long-operation-on-score some-param) (swap! a + score) Now we only have a simple addition inside the swap! and we will have less contention between the CPUs because they will most likely be spending more time inside 'long-operation-on-score' instead of inside the swap. *TL;DR*: do as little work as possible inside swap! the more you have inside swap! the higher chance you will have of throwing away work due to swap! retries. Timothy On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta wrote: > by the way, have you tried both Oracle and Open JDK with the same > results? > Gianluca > > On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut > wrote: >> >> David, you say "Based on jvisualvm monitoring, doesn't seem to be >> GC-related". >> >> What is jvisualvm showing you related to GC and/or memory allocation >> when you tried the 18-core version with 18 threads in the same process? >> >> Even memory allocation could become a point of contention, depending >> upon how the memory allocation works with many threads. e.g. Depends on >> whether a thread gets a large chunk of memory
Re: Poor parallelization performance across 18 cores (but not 4)
Could you also show how you are running these functions in parallel and time them ? The way you start the functions can have as much impact as the functions themselves. Regards, Niels On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: > > I have functions f1 and f2 below, and let's say they run in T1 and T2 > amount of time when running a single instance/thread. The issue I'm facing > is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and > for more complex funcs takes absurdly long. > > >1. (defn f1 [] >2. (apply + (range 2e9))) >3. >4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' >should never retry. >5. (defn f2 [] >6. (let [x* (atom {})] >7. (loop [i 1e9] >8. (when-not (zero? i) >9. (swap! x* assoc :k i) >10. (recur (dec i)) > > > Of note: > - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2 > for 4 runs in parallel) > - running 18 f1's in parallel on the 18-core machine also parallelizes > well. > - Disabling hyperthreading doesn't help. > - Based on jvisualvm monitoring, doesn't seem to be GC-related > - also tried on dedicated 18-core ec2 instance with same issues, so not > shared-tenancy-related > - if I make a jar that runs a single f2 and launch 18 in parallel, it > parallelizes well (so I don't think it's machine/aws-related) > > Could it be that the 18 f2's in parallel on a single JVM instance is > overworking the STM with all the swap's? Any other theories? > > Thanks! > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Poor parallelization performance across 18 cores (but not 4)
Andy: Interesting. Thanks for educating me on the fact that atom swap's don't use the STM. Your theory seems plausible... I will try those tests next time I launch the 18-core instance, but yeah, not sure how illuminating the results will be. Niels: along the lines of this (so that each thread prints its time as well as printing the overall time): 1. (time 2.(let [f f1 3. n-runs 18 4. futs (do (for [i (range n-runs)] 5. (future (time (f)] 6. (doseq [fut futs] 7.@fut))) On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren wrote: > > Could you also show how you are running these functions in parallel and > time them ? The way you start the functions can have as much impact as the > functions themselves. > > Regards, > Niels > > On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: >> >> I have functions f1 and f2 below, and let's say they run in T1 and T2 >> amount of time when running a single instance/thread. The issue I'm facing >> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and >> for more complex funcs takes absurdly long. >> >> >>1. (defn f1 [] >>2. (apply + (range 2e9))) >>3. >>4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' >>should never retry. >>5. (defn f2 [] >>6. (let [x* (atom {})] >>7. (loop [i 1e9] >>8. (when-not (zero? i) >>9. (swap! x* assoc :k i) >>10. (recur (dec i)) >> >> >> Of note: >> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2 >> for 4 runs in parallel) >> - running 18 f1's in parallel on the 18-core machine also parallelizes >> well. >> - Disabling hyperthreading doesn't help. >> - Based on jvisualvm monitoring, doesn't seem to be GC-related >> - also tried on dedicated 18-core ec2 instance with same issues, so not >> shared-tenancy-related >> - if I make a jar that runs a single f2 and launch 18 in parallel, it >> parallelizes well (so I don't think it's machine/aws-related) >> >> Could it be that the 18 f2's in parallel on a single JVM instance is >> overworking the STM with all the swap's? Any other theories? >> >> Thanks! >> > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Poor parallelization performance across 18 cores (but not 4)
correction: that "do" should be a "doall". (My actual test code was a bit different, but each run printed some info when it started so it doesn't have to do with delayed evaluation of lazy seq's or anything). On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: > > Andy: Interesting. Thanks for educating me on the fact that atom swap's > don't use the STM. Your theory seems plausible... I will try those tests > next time I launch the 18-core instance, but yeah, not sure how > illuminating the results will be. > > Niels: along the lines of this (so that each thread prints its time as > well as printing the overall time): > >1. (time >2.(let [f f1 >3. n-runs 18 >4. futs (do (for [i (range n-runs)] >5. (future (time (f)] >6. (doseq [fut futs] >7.@fut))) > > > On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren > wrote: >> >> Could you also show how you are running these functions in parallel and >> time them ? The way you start the functions can have as much impact as the >> functions themselves. >> >> Regards, >> Niels >> >> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: >>> >>> I have functions f1 and f2 below, and let's say they run in T1 and T2 >>> amount of time when running a single instance/thread. The issue I'm facing >>> is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and >>> for more complex funcs takes absurdly long. >>> >>> >>>1. (defn f1 [] >>>2. (apply + (range 2e9))) >>>3. >>>4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' >>>should never retry. >>>5. (defn f2 [] >>>6. (let [x* (atom {})] >>>7. (loop [i 1e9] >>>8. (when-not (zero? i) >>>9. (swap! x* assoc :k i) >>>10. (recur (dec i)) >>> >>> >>> Of note: >>> - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and >>> T2 for 4 runs in parallel) >>> - running 18 f1's in parallel on the 18-core machine also parallelizes >>> well. >>> - Disabling hyperthreading doesn't help. >>> - Based on jvisualvm monitoring, doesn't seem to be GC-related >>> - also tried on dedicated 18-core ec2 instance with same issues, so not >>> shared-tenancy-related >>> - if I make a jar that runs a single f2 and launch 18 in parallel, it >>> parallelizes well (so I don't think it's machine/aws-related) >>> >>> Could it be that the 18 f2's in parallel on a single JVM instance is >>> overworking the STM with all the swap's? Any other theories? >>> >>> Thanks! >>> >> -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Poor parallelization performance across 18 cores (but not 4)
David, you say "Based on jvisualvm monitoring, doesn't seem to be GC-related". What is jvisualvm showing you related to GC and/or memory allocation when you tried the 18-core version with 18 threads in the same process? Even memory allocation could become a point of contention, depending upon how the memory allocation works with many threads. e.g. Depends on whether a thread gets a large chunk of memory on a global lock, and then locally carves it up into the small pieces it needs for each individual Java 'new' allocation, or gets a global lock for every 'new'. The latter would give terrible performance as # cores increase, but I don't know how to tell whether that is the case, except by knowing more about how the memory allocator is implemented in your JVM. Maybe digging through OpenJDK source code in the right place would tell? Andy On Tue, Nov 17, 2015 at 2:00 AM, David Ibawrote: > correction: that "do" should be a "doall". (My actual test code was a bit > different, but each run printed some info when it started so it doesn't > have to do with delayed evaluation of lazy seq's or anything). > > > On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: >> >> Andy: Interesting. Thanks for educating me on the fact that atom swap's >> don't use the STM. Your theory seems plausible... I will try those tests >> next time I launch the 18-core instance, but yeah, not sure how >> illuminating the results will be. >> >> Niels: along the lines of this (so that each thread prints its time as >> well as printing the overall time): >> >>1. (time >>2.(let [f f1 >>3. n-runs 18 >>4. futs (do (for [i (range n-runs)] >>5. (future (time (f)] >>6. (doseq [fut futs] >>7.@fut))) >> >> >> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van Klaveren >> wrote: >>> >>> Could you also show how you are running these functions in parallel and >>> time them ? The way you start the functions can have as much impact as the >>> functions themselves. >>> >>> Regards, >>> Niels >>> >>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: I have functions f1 and f2 below, and let's say they run in T1 and T2 amount of time when running a single instance/thread. The issue I'm facing is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and for more complex funcs takes absurdly long. 1. (defn f1 [] 2. (apply + (range 2e9))) 3. 4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' should never retry. 5. (defn f2 [] 6. (let [x* (atom {})] 7. (loop [i 1e9] 8. (when-not (zero? i) 9. (swap! x* assoc :k i) 10. (recur (dec i)) Of note: - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2 for 4 runs in parallel) - running 18 f1's in parallel on the 18-core machine also parallelizes well. - Disabling hyperthreading doesn't help. - Based on jvisualvm monitoring, doesn't seem to be GC-related - also tried on dedicated 18-core ec2 instance with same issues, so not shared-tenancy-related - if I make a jar that runs a single f2 and launch 18 in parallel, it parallelizes well (so I don't think it's machine/aws-related) Could it be that the 18 f2's in parallel on a single JVM instance is overworking the STM with all the swap's? Any other theories? Thanks! >>> -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options,
Re: Poor parallelization performance across 18 cores (but not 4)
There is no STM involved if you only have atoms, and no refs, so it can't be STM-related. I have a conjecture, but don't yet have a suggestion for an experiment that would prove or disprove it. The JVM memory model requires that changes to values that should be visible to all threads, like swap! on an atom, require making state changes to those values visible to all threads, which I think may often be implemented by flushing any local cache values to main memory, even if no other thread actually reads the value. Your f1 code only does thread-local computation with no requirement to make its results visible to other threads. Your f2 code must make its results visible to other threads. Not only that, but the values it must make visible are allocating new memory with each new value (via calls to assoc). Perhaps main memory is not fast enough to keep up with 18 cores running f2 at full rate, but it is fast enough to keep up with 4 cores running f2 at full rate? Maybe collecting data for time to completion for all number of cores running f2 from 4 up to 18 on the same hardware would be illuminating? Especially if it showed that there was some maximum number of 'f2 iterations per second' total that was equal across any number of cores running f2 in parallel? I am not sure whether that would explain your results of running 18 separate processes each running 1 thread of f2 in parallel getting full speedup, unless the JVM can tell only one thread is running and thus no flushes to main memory are required. Maybe try running 9 processes, each with 2 f2 threads, to see if it is as bad as 1 process with 18 threads? Andy On Mon, Nov 16, 2015 at 9:01 PM, David Ibawrote: > I have functions f1 and f2 below, and let's say they run in T1 and T2 > amount of time when running a single instance/thread. The issue I'm facing > is that parallelizing f2 across 18 cores takes anywhere from 2-5X T2, and > for more complex funcs takes absurdly long. > > >1. (defn f1 [] >2. (apply + (range 2e9))) >3. >4. ;; Note: each call to (f2) makes its own x* atom, so the 'swap!' >should never retry. >5. (defn f2 [] >6. (let [x* (atom {})] >7. (loop [i 1e9] >8. (when-not (zero? i) >9. (swap! x* assoc :k i) >10. (recur (dec i)) > > > Of note: > - On a 4-core machine, both f1 and f2 parallelize well (roungly T1 and T2 > for 4 runs in parallel) > - running 18 f1's in parallel on the 18-core machine also parallelizes > well. > - Disabling hyperthreading doesn't help. > - Based on jvisualvm monitoring, doesn't seem to be GC-related > - also tried on dedicated 18-core ec2 instance with same issues, so not > shared-tenancy-related > - if I make a jar that runs a single f2 and launch 18 in parallel, it > parallelizes well (so I don't think it's machine/aws-related) > > Could it be that the 18 f2's in parallel on a single JVM instance is > overworking the STM with all the swap's? Any other theories? > > Thanks! > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.