Re: gemacl: Scientific computing application written in Clojure
Interesting read Jose, thanks! It might be interesting to try a transducer on (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) if you can get your hands on the 1.7 alpha and the time and inclination to do it. Transducers have shown to be faster than running functions in sequence. Although I don't know how likely they are to beat native arrays, probably not very much. On Sunday, December 21, 2014 7:10:41 PM UTC+1, Jose M. Perez Sanchez wrote: Regarding the speed optimizations, execution time for a given model was reduced from 2735 seconds to 70 seconds, over several versions by doing several optimizations. The same calculation implemented in C# takes 12 seconds using the same computer and OS. Maybe the Clojure code can still be improved, but for the time being I'm happy with the Clojure version being six times slower, since the new software has many advantages. For these tests the model was the circle with radius 1 using the diffmr1 tracker, the simulation was run using 1 particles and 1 total random walk steps. These modifications in the critical parts of the code accounted for most of the improvement: - Avoid reflection by using type hints. - Use Java arrays. - In some cases call Java arithmetic functions directly instead of Clojure ones. - Avoid using partial functions in the critical parts of the code. Avoiding lazyness did not help much. Regarding the use of Java arrays, there are many small functions performing typical vector operations on arrays, such as the following example: Using Clojure types: (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) Using Java arrays: (defn dot-prod-j Returns the dot product of two arrays of doubles [^doubles v1 ^doubles v2] (areduce v1 i ret 0.0 (+ ret (* (aget v1 i) (aget v2 i) This gives a general idea of which optimizations helped the most. These changes are not in the public repository, since previous commits have been omitted because the code code was not ready for publication (different license disclaimer, contained email addresses, etc.). If anyone is interested in the diffs and the execution times over several optimizations, please contact me. Kind regards, Jose. On Sunday, December 21, 2014 3:38:35 AM UTC-5, Jose M. Perez Sanchez wrote: Hi everyone: Sorry that it has taken so long. I've just released the software in GitHub under the EPL. It can be found at: https://github.com/iosephus/gema Kind regards, Jose. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
For most array operations (e.g. dot products on vectors), I strongly recommend trying out the recent core.matrix implementations. We've put a lot of effort into fast implementations and a nice clean Clojure API so I'd love to see them used where it makes sense! For example vectorz-clj can be over 100x faster than a naive map / reduce implementation: (let [a (vec (range 1)) b (vec (range 1))] (time (dotimes [i 100] (reduce + (map * a b) Elapsed time: 364.590211 msecs (let [a (array :vectorz (range 1)) b (array :vectorz (range 1))] (time (dotimes [i 100] (dot a b Elapsed time: 3.358484 msecs On Monday, 22 December 2014 17:31:41 UTC+8, Henrik Eneroth wrote: Interesting read Jose, thanks! It might be interesting to try a transducer on (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) if you can get your hands on the 1.7 alpha and the time and inclination to do it. Transducers have shown to be faster than running functions in sequence. Although I don't know how likely they are to beat native arrays, probably not very much. On Sunday, December 21, 2014 7:10:41 PM UTC+1, Jose M. Perez Sanchez wrote: Regarding the speed optimizations, execution time for a given model was reduced from 2735 seconds to 70 seconds, over several versions by doing several optimizations. The same calculation implemented in C# takes 12 seconds using the same computer and OS. Maybe the Clojure code can still be improved, but for the time being I'm happy with the Clojure version being six times slower, since the new software has many advantages. For these tests the model was the circle with radius 1 using the diffmr1 tracker, the simulation was run using 1 particles and 1 total random walk steps. These modifications in the critical parts of the code accounted for most of the improvement: - Avoid reflection by using type hints. - Use Java arrays. - In some cases call Java arithmetic functions directly instead of Clojure ones. - Avoid using partial functions in the critical parts of the code. Avoiding lazyness did not help much. Regarding the use of Java arrays, there are many small functions performing typical vector operations on arrays, such as the following example: Using Clojure types: (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) Using Java arrays: (defn dot-prod-j Returns the dot product of two arrays of doubles [^doubles v1 ^doubles v2] (areduce v1 i ret 0.0 (+ ret (* (aget v1 i) (aget v2 i) This gives a general idea of which optimizations helped the most. These changes are not in the public repository, since previous commits have been omitted because the code code was not ready for publication (different license disclaimer, contained email addresses, etc.). If anyone is interested in the diffs and the execution times over several optimizations, please contact me. Kind regards, Jose. On Sunday, December 21, 2014 3:38:35 AM UTC-5, Jose M. Perez Sanchez wrote: Hi everyone: Sorry that it has taken so long. I've just released the software in GitHub under the EPL. It can be found at: https://github.com/iosephus/gema Kind regards, Jose. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
I'll second the use of core.matrix. It's a wonderful, idiomatic, fast library, and I hope to see folks continue to rally around it. On Monday, December 22, 2014 3:47:59 AM UTC-7, Mikera wrote: For most array operations (e.g. dot products on vectors), I strongly recommend trying out the recent core.matrix implementations. We've put a lot of effort into fast implementations and a nice clean Clojure API so I'd love to see them used where it makes sense! For example vectorz-clj can be over 100x faster than a naive map / reduce implementation: (let [a (vec (range 1)) b (vec (range 1))] (time (dotimes [i 100] (reduce + (map * a b) Elapsed time: 364.590211 msecs (let [a (array :vectorz (range 1)) b (array :vectorz (range 1))] (time (dotimes [i 100] (dot a b Elapsed time: 3.358484 msecs On Monday, 22 December 2014 17:31:41 UTC+8, Henrik Eneroth wrote: Interesting read Jose, thanks! It might be interesting to try a transducer on (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) if you can get your hands on the 1.7 alpha and the time and inclination to do it. Transducers have shown to be faster than running functions in sequence. Although I don't know how likely they are to beat native arrays, probably not very much. On Sunday, December 21, 2014 7:10:41 PM UTC+1, Jose M. Perez Sanchez wrote: Regarding the speed optimizations, execution time for a given model was reduced from 2735 seconds to 70 seconds, over several versions by doing several optimizations. The same calculation implemented in C# takes 12 seconds using the same computer and OS. Maybe the Clojure code can still be improved, but for the time being I'm happy with the Clojure version being six times slower, since the new software has many advantages. For these tests the model was the circle with radius 1 using the diffmr1 tracker, the simulation was run using 1 particles and 1 total random walk steps. These modifications in the critical parts of the code accounted for most of the improvement: - Avoid reflection by using type hints. - Use Java arrays. - In some cases call Java arithmetic functions directly instead of Clojure ones. - Avoid using partial functions in the critical parts of the code. Avoiding lazyness did not help much. Regarding the use of Java arrays, there are many small functions performing typical vector operations on arrays, such as the following example: Using Clojure types: (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) Using Java arrays: (defn dot-prod-j Returns the dot product of two arrays of doubles [^doubles v1 ^doubles v2] (areduce v1 i ret 0.0 (+ ret (* (aget v1 i) (aget v2 i) This gives a general idea of which optimizations helped the most. These changes are not in the public repository, since previous commits have been omitted because the code code was not ready for publication (different license disclaimer, contained email addresses, etc.). If anyone is interested in the diffs and the execution times over several optimizations, please contact me. Kind regards, Jose. On Sunday, December 21, 2014 3:38:35 AM UTC-5, Jose M. Perez Sanchez wrote: Hi everyone: Sorry that it has taken so long. I've just released the software in GitHub under the EPL. It can be found at: https://github.com/iosephus/gema Kind regards, Jose. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
Thank you very much for your replies. I will definitely take a look at core.matrix. I really hate the fact that I had to use Java arrays to make it fast. I'll take a look at transducers as well. Kind regards, Jose. On Monday, December 22, 2014 7:09:27 PM UTC-5, Christopher Small wrote: I'll second the use of core.matrix. It's a wonderful, idiomatic, fast library, and I hope to see folks continue to rally around it. On Monday, December 22, 2014 3:47:59 AM UTC-7, Mikera wrote: For most array operations (e.g. dot products on vectors), I strongly recommend trying out the recent core.matrix implementations. We've put a lot of effort into fast implementations and a nice clean Clojure API so I'd love to see them used where it makes sense! For example vectorz-clj can be over 100x faster than a naive map / reduce implementation: (let [a (vec (range 1)) b (vec (range 1))] (time (dotimes [i 100] (reduce + (map * a b) Elapsed time: 364.590211 msecs (let [a (array :vectorz (range 1)) b (array :vectorz (range 1))] (time (dotimes [i 100] (dot a b Elapsed time: 3.358484 msecs On Monday, 22 December 2014 17:31:41 UTC+8, Henrik Eneroth wrote: Interesting read Jose, thanks! It might be interesting to try a transducer on (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) if you can get your hands on the 1.7 alpha and the time and inclination to do it. Transducers have shown to be faster than running functions in sequence. Although I don't know how likely they are to beat native arrays, probably not very much. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
Hi everyone: Sorry that it has taken so long. I've just released the software in GitHub under the EPL. It can be found at: https://github.com/iosephus/gema Kind regards, Jose. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
Regarding the speed optimizations, execution time for a given model was reduced from 2735 seconds to 70 seconds, over several versions by doing several optimizations. The same calculation implemented in C# takes 12 seconds using the same computer and OS. Maybe the Clojure code can still be improved, but for the time being I'm happy with the Clojure version being six times slower, since the new software has many advantages. For these tests the model was the circle with radius 1 using the diffmr1 tracker, the simulation was run using 1 particles and 1 total random walk steps. These modifications in the critical parts of the code accounted for most of the improvement: - Avoid reflection by using type hints. - Use Java arrays. - In some cases call Java arithmetic functions directly instead of Clojure ones. - Avoid using partial functions in the critical parts of the code. Avoiding lazyness did not help much. Regarding the use of Java arrays, there are many small functions performing typical vector operations on arrays, such as the following example: Using Clojure types: (defn dot-prod Returns the dot product of two vectors [v1 v2] (reduce + (map * v1 v2))) Using Java arrays: (defn dot-prod-j Returns the dot product of two arrays of doubles [^doubles v1 ^doubles v2] (areduce v1 i ret 0.0 (+ ret (* (aget v1 i) (aget v2 i) This gives a general idea of which optimizations helped the most. These changes are not in the public repository, since previous commits have been omitted because the code code was not ready for publication (different license disclaimer, contained email addresses, etc.). If anyone is interested in the diffs and the execution times over several optimizations, please contact me. Kind regards, Jose. On Sunday, December 21, 2014 3:38:35 AM UTC-5, Jose M. Perez Sanchez wrote: Hi everyone: Sorry that it has taken so long. I've just released the software in GitHub under the EPL. It can be found at: https://github.com/iosephus/gema Kind regards, Jose. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
On Tuesday, June 3, 2014 12:46:55 PM UTC-5, Mars0i wrote: (def ones (doall (repeat 1000 1))) (bench (def _ (doall (map rand ones ; 189 microseconds average time (bench (def _ (doall (pmap rand ones ; 948 microseconds average time For the record, I worried later that rand was too inexpensive, and that those results were being driven only by the cost of setting up threads in pmap. This seems like a better test: (bench (def _ (doall (map #(nth (iterate rand %) 1) (repeat 256 1) ; 185 milliseconds average time (bench (def _ (doall (pmap #(nth (iterate rand %) 1) (repeat 256 1) ; 793 milliseconds average time I have been having success getting a speedup simply by changing certain map calls to pmap in my main project. I'm sure that many of us will be interested in a report whenever you get to it, but I can easily imagine that finding the time to summarize what you've learned is difficult. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
Jose, This is an old thread, and whatever problems you might be dealing with now, they're probably not the same ones as when the thread was active. However, I think that if parallel code uses the built-in Clojure random number functions, there is probably a bottleneck in access to the RNG. With Criterium's bench function on an 8-core machine: (def ones (doall (repeat 1000 1))) (bench (def _ (doall (map rand ones ; 189 microseconds average time (bench (def _ (doall (pmap rand ones ; 948 microseconds average time One solution that doesn't involve generating the numbers in advance is to create separate RNGs, as discussed in this thread https://groups.google.com/forum/#!searchin/clojure/random/clojure/cRVS19PB06E/8FsmtsYx6SkJ. This is a strategy that I am starting to explore. Related notes for anyone interested: As of Incanter 1.5.5 at least some functions such as sample are based on Clojure's bult-in rand, so they would have this problem as well. clojure.data.generators allows rebinding the RNG, and provides reservoir-sample and a replacement for the default rand-nth. The bigml/sampling https://github.com/bigmlcom/sampling/tree/master/src/bigml/sampling library provides sampling and random number functions with optional generation of a new RNG. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
Thank you very much. I'm using the Colt random number generator directly. I've managed to reduce computing time by orders of magnitude using type hints and java arrays in some critical parts. I haven't had the time to write a report on this for the list, since have been busy with other projects, but this will come as well as the release of the source code. Thanks again, Jose. On Tuesday, June 3, 2014 1:46:55 PM UTC-4, Mars0i wrote: Jose, This is an old thread, and whatever problems you might be dealing with now, they're probably not the same ones as when the thread was active. However, I think that if parallel code uses the built-in Clojure random number functions, there is probably a bottleneck in access to the RNG. With Criterium's bench function on an 8-core machine: (def ones (doall (repeat 1000 1))) (bench (def _ (doall (map rand ones ; 189 microseconds average time (bench (def _ (doall (pmap rand ones ; 948 microseconds average time One solution that doesn't involve generating the numbers in advance is to create separate RNGs, as discussed in this thread https://groups.google.com/forum/#!searchin/clojure/random/clojure/cRVS19PB06E/8FsmtsYx6SkJ. This is a strategy that I am starting to explore. Related notes for anyone interested: As of Incanter 1.5.5 at least some functions such as sample are based on Clojure's bult-in rand, so they would have this problem as well. clojure.data.generators allows rebinding the RNG, and provides reservoir-sample and a replacement for the default rand-nth. The bigml/sampling https://github.com/bigmlcom/sampling/tree/master/src/bigml/sampling library provides sampling and random number functions with optional generation of a new RNG. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: gemacl: Scientific computing application written in Clojure
Yes, the step extract function encodes the total number of steps and any intermediate steps whose values are to be saved. I did the following changes to the code: 1 - Store results locally in the threads and return them when the thread function exits, instead of using global vector. This does not impact performance directly (tested), but allows to use a transient vector to store the results locally, which is faster. 2 - Use loop/recur to loop over the particles, the steps and the valid displacement generation (instead of lazy sequences with extract function). Also in a few other small loops that are executed many times. 3 - Use transients in any vector to which a lot of data is going to be conjoined during the calculation. These changes brought the following results. There is some improvement, both in computing time and scaling. See graphs attached: The master branch is the old code I posted already and the perftest branch contains the changes. I'm sure there is still room for improvement, and I'll focus on that as soon as some important missing features get implemented and I can finish some calculations that we need urgently. kovasb: Could you elaborate the last part of I think you should try making the core iteration purely functional, meaning no agents, atoms, *or side effecting functions like the random generator*? I did remove the atom and agent (I keep a global integer ref though since I need to track the progress of the calculation). Regarding the random displacements, if it means generating them first and then consuming them in a side effect free fashion, it would take a lot of RAM to store all those numbers... Thanks a lot for the help, I'll keep you posted about any other tests that might be interesting and will let you know when the code gets released. Best, Jose. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. benchmark.pdf Description: Adobe PDF document scaling.pdf Description: Adobe PDF document
Re: gemacl: Scientific computing application written in Clojure
Hi Andy, cej38, kovas: Thanks for the replies. I plan to release the whole code soon (waiting for institutional authorization). I do use lazyness both within the move function to select the allowed random displacements and when iterating the move function to generate the trajectory. Lazy structures are only consumed within the thread in which they are created. Here is the core code where the computations happens: (defn step-particle Returns a new value for particle after moving particle once to a new position from the current one [pdf-get-fn step-condition-fn calc-value-fns particle] (let [pos (particle :pos) disp (first (filter (fn [x] (step-condition-fn (particle :pos) x)) (repeatedly (fn [] (pdf-get-fn) new-pos (mapv + pos disp) new-base-particle {:pos new-pos :steps (inc (particle :steps))} new-trackers-results (if (seq calc-value-fns) (zipmap (keys calc-value-fns) ((apply juxt (vals calc-value-fns)) particle new-base-particle)) {})] (merge new-trackers-results new-base-particle))) (defn walk-particles While there is work to do, create new particles, move them n-steps, then send them to particle container (agent) [todo particles simul-info init-get-fn init-condition step-get-fn step-condition trackers-maps step-extract-fn] (let [init-value-fns (zipmap (keys trackers-maps) (map :create-fn (vals trackers-maps))) calc-value-fns (zipmap (keys trackers-maps) (map :compute-fn (vals trackers-maps))) move (partial step-particle step-get-fn step-condition calc-value-fns)] (while ( @todo 0) (swap! todo dec) (let [p (last (create-particle init-get-fn init-condition init-value-fns)) lazy-steps (iterate move p) result (step-extract-fn lazy-steps)] (send-off particles (fn [x] (conj x result))) Each worker is created launching a future that executes walk-particles, each worker has a separate Mersenne Twister random number generator embedded into the pdf-get-fn (using partial on a common pdf-get function and different MT generators). In real calculations both number of particles and steps are at least 1e4. In the benchmarks I'm posting particles are 1000, steps are 5000. As expected conjoining to a single global vector poses no problem, I tested both conjoining to a single global vector and to separate global vectors (one per worker) and the computing time is the same. I could test in another system with 16 cores. See the results attached for the 8 and 16 core systems. Best, Jose. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. benchmark.pdf Description: Adobe PDF document
Re: gemacl: Scientific computing application written in Clojure
Hi Jose, I think you should try making the core iteration purely functional, meaning no agents, atoms, or side effecting functions like the random generator. I assume the number of steps you evolve the particle is encoded in step-extract-fn? What you probably want is something like (loop [i 0 pos initial-position] (if ( i num-of-steps) (recur (+ i 1) (move pos)) ;; iterate pos ;; if done, return final position )) This will make it easier to benchmark the iteration step, which is an important number to know. I'm sure you can make it much faster, if perf is the ultimate goal its worth tuning a little. In terms of distributing the work, I would not use atoms or agents. They are not meant for parallelism or for work queues. With agents and futures you need to be aware of the various thread pools involved under the hood and make sure you are not saturating them. And combined with laziness, it takes care to ensure work is getting done where you are expecting it. It would be easier to reason about what is going on by using threads and queues directly. Enqueue a bunch of work on a queue, and directly set up a bunch of threads that read batches of work from the queue until its empty. If the initial condition / other parameters are the same across workers, you could even skip the work queue, and just set up a bunch of threads that just do the iterations and then dump their result somewhere. I also definitely recommend making friends with loop/recur: (time (loop [i 0] (if ( i 100) (recur (+ 1 i)) true))) = Elapsed time: 2.441 msecs (def i (atom 0)) (time (while ( @i 100) (swap! i + 1))) = Elapsed time: 52.767 msecs loop/recur is both simpler and faster, and the best way to rapidly iterate. On Mon, Nov 18, 2013 at 7:47 PM, Jose M. Perez Sanchez m...@josemperez.com wrote: Hi Andy, cej38, kovas: Thanks for the replies. I plan to release the whole code soon (waiting for institutional authorization). I do use lazyness both within the move function to select the allowed random displacements and when iterating the move function to generate the trajectory. Lazy structures are only consumed within the thread in which they are created. Here is the core code where the computations happens: (defn step-particle Returns a new value for particle after moving particle once to a new position from the current one [pdf-get-fn step-condition-fn calc-value-fns particle] (let [pos (particle :pos) disp (first (filter (fn [x] (step-condition-fn (particle :pos) x)) (repeatedly (fn [] (pdf-get-fn) new-pos (mapv + pos disp) new-base-particle {:pos new-pos :steps (inc (particle :steps))} new-trackers-results (if (seq calc-value-fns) (zipmap (keys calc-value-fns) ((apply juxt (vals calc-value-fns)) particle new-base-particle)) {})] (merge new-trackers-results new-base-particle))) (defn walk-particles While there is work to do, create new particles, move them n-steps, then send them to particle container (agent) [todo particles simul-info init-get-fn init-condition step-get-fn step-condition trackers-maps step-extract-fn] (let [init-value-fns (zipmap (keys trackers-maps) (map :create-fn (vals trackers-maps))) calc-value-fns (zipmap (keys trackers-maps) (map :compute-fn (vals trackers-maps))) move (partial step-particle step-get-fn step-condition calc-value-fns)] (while ( @todo 0) (swap! todo dec) (let [p (last (create-particle init-get-fn init-condition init-value-fns)) lazy-steps (iterate move p) result (step-extract-fn lazy-steps)] (send-off particles (fn [x] (conj x result))) Each worker is created launching a future that executes walk-particles, each worker has a separate Mersenne Twister random number generator embedded into the pdf-get-fn (using partial on a common pdf-get function and different MT generators). In real calculations both number of particles and steps are at least 1e4. In the benchmarks I'm posting particles are 1000, steps are 5000. As expected conjoining to a single global vector poses no problem, I tested both conjoining to a single global vector and to separate global vectors (one per worker) and the computing time is the same. I could test in another system with 16 cores. See the results attached for the 8 and 16 core systems. Best, Jose. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to
Re: gemacl: Scientific computing application written in Clojure
It is hard to say where the root of your problem lies without looking at the code more. I would look closely at laziness. I find that lazy evaluation really kills parallelization. On Friday, November 8, 2013 4:42:11 PM UTC-5, Jose M. Perez Sanchez wrote: Hello everyone: This is my first post here. I'm a researcher writing a numerical simulation software in Clojure. Actually, I'm porting an app a coworker and I wrote in C/Python (called GEMA) to Clojure: The app has been in use for a while at our group, but became very difficult to maintain due to outgrowing its initial design and being very monolithic and at the same time I wanted to learn Functional Programming, so I've been working in the port for a few weeks. The simulations are embarrassingly parallel Random Walk calculations used to study gas diffusion and Helium-3 Magnetic Resonance diffusion measurements in the lungs. At the core of the simulations we do there is a 3D geometrical model of the pulmonary acinus. The new application is designed in a modular fashion, I'm including part of the current README file with :browse confirm wa a description. I've approached my institution's Technology Transfer Office to request authorization to release the software under an Open Source license, and if everything goes well the code will be published soon. I'm very happy in my Clojure trip so far and all the things I'm learning in the process. One of the things I've observed is poor scaling with the number of threads for more than 4 threads in an 8-core Intel i7 CPU, as follows: NTTime cpu%x8 1 101.9 108 2 54.9 220 4 36.0 430 6 33.9 570 8 32.5 700 10 32.5 720 Computing times reported are just the time spent in the computation of the NT futures (not total program execution time). CPU x8 percent is measured with top in Linux and the % values are approximate, just to give an idea. I'm running on Debian Wheezy with the following Java platform: JRE: OpenJDK Runtime Environment 1.6.0_27-b27 on Linux 3.2.0-4-amd64 (amd64) JVM: OpenJDK 64-Bit Server VM (build 20.0-b12 mixed mode) I'll try in a 16 core (4-way Opteron) soon and see what happens there. The computing happens over an infinite lazy sequence of random walk steps generated with (iterate move particle), when an extraction function gets values from zero to the highest number of random walk steps and adds (conj) the values to be kept to a vector. The resulting vector for each particle is then added (conj) to a global vector for latter storage. I've read the previous post about concurrent performance in AMD processors: https://groups.google.com/forum/#!topic/clojure/48W2eff3caU%5B1-25-false%5D. Have to do it again with more time though, to check whether any of the explanations presented there applies to my application. Best regards, Jose Manuel. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: gemacl: Scientific computing application written in Clojure
Sounds like some form of overhead is dominating the computation. How are the infinite sequences being consumed? Is it 1 thread per sequence? How compute-intensive is (move particle) ? What kind of numbers of are talking about in terms of steps, particles? If move is fast, you probably need to batch up your computation. If move is a simple arithmetic operation or otherwise something without an inner loop, I'd make it perform at least 100 iterations per invocation. If you have many particles, I'd pay attention to how the threads are switching between them, and eliminate any switching if possible. I'd definitely recommend removing the global recording to reduce complexity for now. On Fri, Nov 8, 2013 at 4:42 PM, Jose M. Perez Sanchez m...@josemperez.com wrote: Hello everyone: This is my first post here. I'm a researcher writing a numerical simulation software in Clojure. Actually, I'm porting an app a coworker and I wrote in C/Python (called GEMA) to Clojure: The app has been in use for a while at our group, but became very difficult to maintain due to outgrowing its initial design and being very monolithic and at the same time I wanted to learn Functional Programming, so I've been working in the port for a few weeks. The simulations are embarrassingly parallel Random Walk calculations used to study gas diffusion and Helium-3 Magnetic Resonance diffusion measurements in the lungs. At the core of the simulations we do there is a 3D geometrical model of the pulmonary acinus. The new application is designed in a modular fashion, I'm including part of the current README file with :browse confirm wa a description. I've approached my institution's Technology Transfer Office to request authorization to release the software under an Open Source license, and if everything goes well the code will be published soon. I'm very happy in my Clojure trip so far and all the things I'm learning in the process. One of the things I've observed is poor scaling with the number of threads for more than 4 threads in an 8-core Intel i7 CPU, as follows: NTTime cpu%x8 1 101.9 108 2 54.9 220 4 36.0 430 6 33.9 570 8 32.5 700 10 32.5 720 Computing times reported are just the time spent in the computation of the NT futures (not total program execution time). CPU x8 percent is measured with top in Linux and the % values are approximate, just to give an idea. I'm running on Debian Wheezy with the following Java platform: JRE: OpenJDK Runtime Environment 1.6.0_27-b27 on Linux 3.2.0-4-amd64 (amd64) JVM: OpenJDK 64-Bit Server VM (build 20.0-b12 mixed mode) I'll try in a 16 core (4-way Opteron) soon and see what happens there. The computing happens over an infinite lazy sequence of random walk steps generated with (iterate move particle), when an extraction function gets values from zero to the highest number of random walk steps and adds (conj) the values to be kept to a vector. The resulting vector for each particle is then added (conj) to a global vector for latter storage. I've read the previous post about concurrent performance in AMD processors: https://groups.google.com/forum/#!topic/clojure/48W2eff3caU%5B1-25-false%5D. Have to do it again with more time though, to check whether any of the explanations presented there applies to my application. Best regards, Jose Manuel. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: gemacl: Scientific computing application written in Clojure
Hi Andy: Thanks a lot for your reply. I'll do more careful testing in the very near future and there is surely a lot to optimize in my code. I must say I did expect computing speed reduction coming from an already optimized codebase with the perfomance critical parts written in C, and there is an intentional trade off in my porting effort to get something more maintainable, extensible and scalable. My future plans are to run in a cluster on something like EC2, because I've made the numbers and buying hardware isn't cost effective for us anymore (we paid around EUR 10K for our last big computer and we can do a lot of computing in the cloud for that money). Since the software is used for research, we tend to add features and change it so that it simulates the different scenarios coming out of our scientific discussions: This means we spend almost as much time coding as simulating, and having a higher level language like Clojure helps us enormously. I'll keep you posted about my future performance tests and the Open Source release of the software. Best, Jose. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: gemacl: Scientific computing application written in Clojure
Jose: On re-reading your original post, I noticed one statement you made that may be of interest: The resulting vector for each particle is then added (conj) to a global vector for latter storage. Do you mean that there is a single global vector that is conj'd onto by all N threads? Is this vector in a ref or atom, perhaps, and you use swap! or something similar to update it from all threads? If so, and if you do that frequently from each thread, then that part of your code is definitely not embarrassingly parallel, even if the rest of it is. Andy On Sat, Nov 9, 2013 at 8:06 AM, Jose M. Perez Sanchez m...@josemperez.comwrote: Hi Andy: Thanks a lot for your reply. I'll do more careful testing in the very near future and there is surely a lot to optimize in my code. I must say I did expect computing speed reduction coming from an already optimized codebase with the perfomance critical parts written in C, and there is an intentional trade off in my porting effort to get something more maintainable, extensible and scalable. My future plans are to run in a cluster on something like EC2, because I've made the numbers and buying hardware isn't cost effective for us anymore (we paid around EUR 10K for our last big computer and we can do a lot of computing in the cloud for that money). Since the software is used for research, we tend to add features and change it so that it simulates the different scenarios coming out of our scientific discussions: This means we spend almost as much time coding as simulating, and having a higher level language like Clojure helps us enormously. I'll keep you posted about my future performance tests and the Open Source release of the software. Best, Jose. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: gemacl: Scientific computing application written in Clojure
Hi Andy: Yes, this breaks embarrassing parallelism indeed. When the calculations are done for real this isn't a problem though, because these conj operations to the global list would happen sporadically (in average once every couple of seconds or so) so the probability of a thread waiting for a significant amount of time is very low. In the short benchmarks I posted this happens every few milliseconds in average and it could be a problem. Honestly I don't expect even in the one conj every few ms case to have a problem there. I don't know how computationally expensive is the conj, but for every conj to the global list, at least a few dozens of thousands of random numbers are generated with the Mersenne Twister, and a similar number of other arithmetical operations are done. Several local conj operations inside the thread are also performed and in each of the few thousand steps maps are created and merged. The only way to know for sure is testing though, I'll post the results as soon as I can run a test. Thanks a lot. Best, Jose. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
gemacl: Scientific computing application written in Clojure
Hello everyone: This is my first post here. I'm a researcher writing a numerical simulation software in Clojure. Actually, I'm porting an app a coworker and I wrote in C/Python (called GEMA) to Clojure: The app has been in use for a while at our group, but became very difficult to maintain due to outgrowing its initial design and being very monolithic and at the same time I wanted to learn Functional Programming, so I've been working in the port for a few weeks. The simulations are embarrassingly parallel Random Walk calculations used to study gas diffusion and Helium-3 Magnetic Resonance diffusion measurements in the lungs. At the core of the simulations we do there is a 3D geometrical model of the pulmonary acinus. The new application is designed in a modular fashion, I'm including part of the current README file with :browse confirm wa a description. I've approached my institution's Technology Transfer Office to request authorization to release the software under an Open Source license, and if everything goes well the code will be published soon. I'm very happy in my Clojure trip so far and all the things I'm learning in the process. One of the things I've observed is poor scaling with the number of threads for more than 4 threads in an 8-core Intel i7 CPU, as follows: NTTime cpu%x8 1 101.9 108 2 54.9 220 4 36.0 430 6 33.9 570 8 32.5 700 10 32.5 720 Computing times reported are just the time spent in the computation of the NT futures (not total program execution time). CPU x8 percent is measured with top in Linux and the % values are approximate, just to give an idea. I'm running on Debian Wheezy with the following Java platform: JRE: OpenJDK Runtime Environment 1.6.0_27-b27 on Linux 3.2.0-4-amd64 (amd64) JVM: OpenJDK 64-Bit Server VM (build 20.0-b12 mixed mode) I'll try in a 16 core (4-way Opteron) soon and see what happens there. The computing happens over an infinite lazy sequence of random walk steps generated with (iterate move particle), when an extraction function gets values from zero to the highest number of random walk steps and adds (conj) the values to be kept to a vector. The resulting vector for each particle is then added (conj) to a global vector for latter storage. I've read the previous post about concurrent performance in AMD processors: https://groups.google.com/forum/#!topic/clojure/48W2eff3caU%5B1-25-false%5D. Have to do it again with more time though, to check whether any of the explanations presented there applies to my application. Best regards, Jose Manuel. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. README-brief.md Description: Binary data
Re: gemacl: Scientific computing application written in Clojure
Jose: I am not aware of any conclusive explanation for the issue, and would love to know one if anyone finds out. At least in the case of that program mentioned in the other discussion thread, much better speedup was achieved running N different JVM processes, each single-threaded, on a machine with N CPU cores. If you are willing to try an experiment like that and see whether you get similar results, that would indicate that the issue is due to multiple threads within a single JVM, as opposed to some OS or hardware performance limitation. Below are a list of possible explanations that seem most likely to me, but again, no conclusive evidence for any of them yet: 1. JVM object allocation and/or garbage collector using locks or other multi-threading performance killers 2. CPU core cache thrashing when the thread scheduler causes threads to frequently be scheduled on different CPU cores (I haven't aired that guess before, but it is related to the guess I made near the end of the conversation you link to). 3. CPU core cache thrashing because single-threaded versions have working sets that fit in caches close to CPU cores, but this working set is multiplied by N when running N threads. 4. Some subtle area of Clojure implementation that you are using that is limiting parallelism Andy On Fri, Nov 8, 2013 at 1:42 PM, Jose M. Perez Sanchez m...@josemperez.comwrote: Hello everyone: This is my first post here. I'm a researcher writing a numerical simulation software in Clojure. Actually, I'm porting an app a coworker and I wrote in C/Python (called GEMA) to Clojure: The app has been in use for a while at our group, but became very difficult to maintain due to outgrowing its initial design and being very monolithic and at the same time I wanted to learn Functional Programming, so I've been working in the port for a few weeks. The simulations are embarrassingly parallel Random Walk calculations used to study gas diffusion and Helium-3 Magnetic Resonance diffusion measurements in the lungs. At the core of the simulations we do there is a 3D geometrical model of the pulmonary acinus. The new application is designed in a modular fashion, I'm including part of the current README file with :browse confirm wa a description. I've approached my institution's Technology Transfer Office to request authorization to release the software under an Open Source license, and if everything goes well the code will be published soon. I'm very happy in my Clojure trip so far and all the things I'm learning in the process. One of the things I've observed is poor scaling with the number of threads for more than 4 threads in an 8-core Intel i7 CPU, as follows: NTTime cpu%x8 1 101.9 108 2 54.9 220 4 36.0 430 6 33.9 570 8 32.5 700 10 32.5 720 Computing times reported are just the time spent in the computation of the NT futures (not total program execution time). CPU x8 percent is measured with top in Linux and the % values are approximate, just to give an idea. I'm running on Debian Wheezy with the following Java platform: JRE: OpenJDK Runtime Environment 1.6.0_27-b27 on Linux 3.2.0-4-amd64 (amd64) JVM: OpenJDK 64-Bit Server VM (build 20.0-b12 mixed mode) I'll try in a 16 core (4-way Opteron) soon and see what happens there. The computing happens over an infinite lazy sequence of random walk steps generated with (iterate move particle), when an extraction function gets values from zero to the highest number of random walk steps and adds (conj) the values to be kept to a vector. The resulting vector for each particle is then added (conj) to a global vector for latter storage. I've read the previous post about concurrent performance in AMD processors: https://groups.google.com/forum/#!topic/clojure/48W2eff3caU%5B1-25-false%5D. Have to do it again with more time though, to check whether any of the explanations presented there applies to my application. Best regards, Jose Manuel. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To