Re: Parallelising over a lazy sequence - request for help
Alan, Apologies for the delayed reply - I remember Iota well (there was some cross-fertilisation between it and foldable-seq a few months back IIRC :-) Having said that, I don't think that Iota will help in my particular situation (although I'd be delighted to be proven wrong)? Given that the file I'm processing is an XML file, and will therefore have to pass through an XML parser, unless I write an XML parser on top of the reducers framework, I'm stuck with dealing with sequences at some point along the way? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 30 Sep 2013, at 11:15, Alan Busby thebu...@gmail.com wrote: Sorry to jump in, but I thought it worthwhile to add a couple points; (sorry for being brief) 1. Reducers work fine with data much larger than memory, you just need to mmap() the data you're working with so Clojure thinks everything is in memory when it isn't. Reducer access is fairly sequential, not random, so spinning disks work great here. 2. A 40GB XML file is very often many many smaller XML documents aggregated together. It's often faster to separate each document into it's own line (via various UNIX tools) and parse each line separately. I typically do something like $ zcat bigxml.gz | tr '\n' ' ' | sed 's/foo/\nfoo/' | grep '^foo' records.xml . 3. Check out the Iota library, https://github.com/thebusby/iota/ . I often use for reducing over 100's of GB's worth of text data. It does what Jozef suggests, and makes a text file a foldable collection. 4. While pmap is great for advertising the power of Clojure, it's likely safe to say that it should be ignored if you're actually looking for performance. Hope this helps, Alan Busby -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
Sorry to jump in, but I thought it worthwhile to add a couple points; (sorry for being brief) 1. Reducers work fine with data much larger than memory, you just need to mmap() the data you're working with so Clojure thinks everything is in memory when it isn't. Reducer access is fairly sequential, not random, so spinning disks work great here. 2. A 40GB XML file is very often many many smaller XML documents aggregated together. It's often faster to separate each document into it's own line (via various UNIX tools) and parse each line separately. I typically do something like $ zcat bigxml.gz | tr '\n' ' ' | sed 's/foo/\nfoo/' | grep '^foo' records.xml . 3. Check out the Iota library, https://github.com/thebusby/iota/ . I often use for reducing over 100's of GB's worth of text data. It does what Jozef suggests, and makes a text file a foldable collection. 4. While pmap is great for advertising the power of Clojure, it's likely safe to say that it should be ignored if you're actually looking for performance. Hope this helps, Alan Busby -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
To be clear, I don't object to the approach, only to naming it fold and/or tying it to interfaces related to folding. Stu On Sat, Sep 28, 2013 at 5:29 PM, Paul Butcher p...@paulbutcher.com wrote: On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are many ways to do so - pmap is one easy mechanism to get parallelism over a lazy sequence but the chunking behavior of pmap can easily be non-optimal for particular use cases. Stu's other solution (prepare-with-partition-then-fold) reads chunks of data into memory, then uses f/j over the top of those chunks. Yes - I'm aware that that's how Stu's solution works. However when I use that in my example, the performance sucks. As I said in my earlier message, I believe that the reason for this is the additional copying that it involves. To date, I have exactly one implementation that has good parallel performance - the one that uses foldable-seq (which is 3x faster than the sequential version). Everything else I've tried performs worse than the sequential version. Yet another possible solution is to use a pool of compute threads and to read from the stream, give those compute tasks to the pool to execute in a background thread, then reassemble the results at the end (or in a separate thread). The balance in this approach is to make the tasks the right chunkiness. Using a buffered queue is one technique to avoid having your input reader overload the capacity of the processing system. That's basically *exactly* what the foldable-seq implementation does. I would also mention that using transients as you build input collections will be a big win. I'm sure that that's true - but the foldable-seq implementation is the one that would gain most from that, and that's the one that already runs 3x faster than the sequential implementation. But it's also the one that Stu objects to. What I'm trying to find is an implementation that Stu doesn't object to, but is faster than the sequential version of the algorithm. -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are many ways to do so - pmap is one easy mechanism to get parallelism over a lazy sequence but the chunking behavior of pmap can easily be non-optimal for particular use cases. Stu's other solution (prepare-with-partition-then-fold) reads chunks of data into memory, then uses f/j over the top of those chunks. Yet another possible solution is to use a pool of compute threads and to read from the stream, give those compute tasks to the pool to execute in a background thread, then reassemble the results at the end (or in a separate thread). The balance in this approach is to make the tasks the right chunkiness. Using a buffered queue is one technique to avoid having your input reader overload the capacity of the processing system. I would also mention that using transients as you build input collections will be a big win. Alex On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/**jira/browse/CLJ-1250http://dev.clojure.org/jira/browse/CLJ-1250 JW On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/**stuarthalloway/exploring-** clojure/blob/master/examples/**exploring/reducing_apple_pie.**cljhttps://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word
Re: Parallelising over a lazy sequence - request for help
Paul, is there any easy way to get the (small) dataset you're working with, so we can run your actual code against the same data? On Saturday, May 25, 2013 9:34:15 AM UTC-7, Paul Butcher wrote: The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have Java and Scala versions that perform excellently) but I've been incapable of coming up with a nice solution in Clojure. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
On 29 Sep 2013, at 22:58, Paul Mooser taron...@gmail.com wrote: Paul, is there any easy way to get the (small) dataset you're working with, so we can run your actual code against the same data? The dataset I'm using is a Wikipedia dump, which hardly counts as small :-) Having said that, the first couple of million lines is all you need to reproduce the results I'm getting, which you can download with: curl http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | bunzip2 | head -n 200 enwiki-short.xml -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
Thanks - when I said small, I was referring to the fact that your tests were using the first 1 pages, as opposed to the entire data dump. Sorry if I was unclear or misunderstood. On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote: The dataset I'm using is a Wikipedia dump, which hardly counts as small :-) -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
On the other hand it is 2013, not 2003. 40G is small in terms of modern hardware. Terabyte ram servers have been available for awhile, at prices within the reach of many projects. Large data in this decade is measured in petabytes, at least. On Sunday, September 29, 2013 5:13:14 PM UTC-7, Paul Mooser wrote: Thanks - when I said small, I was referring to the fact that your tests were using the first 1 pages, as opposed to the entire data dump. Sorry if I was unclear or misunderstood. On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote: The dataset I'm using is a Wikipedia dump, which hardly counts as small :-) -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml sequential Elapsed time: 23630.443 msecs $ lein run 1 ~/enwiki.xml fold Elapsed time: 8697.79 msecs $ lein run 1 ~/enwiki.xml pmap 1 Elapsed time: 27393.703 msecs $ lein run 1 ~/enwiki.xml pthenf 1 Elapsed time: 37263.193 msecs As you can see, the the foldable-seq version gives an almost 3x speedup relative to the sequential version, and both the partition-then-pmap and partition-then-fold versions are significantly slower. The last argument for the partition-then-pmap and partition-then-fold versions is a partition size. I've tried various different sizes with no material effect: $ lein run 1 ~/enwiki.xml pthenf 1000 Elapsed time: 43187.385 msecs $ lein run 1 ~/enwiki.xml pthenf 10 Elapsed time: 35702.651 msecs $ lein run 1 ~/enwiki.xml pmap 1000 Elapsed time: 34087.314 msecs $ lein run 1 ~/enwiki.xml pmap 10 Elapsed time: 47340.969 msecs The performance of the partition-then-pmap version is actually much worse than the numbers above suggest. There's something very weird going on with (I guess) garbage collection - it takes a *long* time to finish after printing the elapsed time and the performance is completely pathological with larger page counts. Bottom line: the only way I've been able to obtain any kind of speedup remains foldable-seq. I'd be very grateful indeed if you could take a look at how I've implemented partition-then-fold to make sure that I've correctly captured your intent. Or if you have any suggestions for anything else that might work, or to explain the poor performance of partition-then-pmap and partition-then-fold. My guess is that the problem with partition-then-fold is the copying that's going on during the (into [] %). I can see that it is operating in parallel because the number of cores in use goes up, but the net effect is an overall slowdown rather than a speedup. That it performs worse than foldable-seq isn't surprising to me, given that it introduces an unnecessary copy. I still think that it's a crying shame to disallow folding over sequences - as the above shows, the gains both in performance and programming ease are significant, and it would only take a small modification to the reducers API to fix the holding-onto-head problem. What would be the downside of making this modification and allowing foldable sequences? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote: Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Enjoying your book! Stu On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have Java and Scala versions that perform excellently) but I've been incapable of coming up with a nice solution in Clojure. The code I'm working with is checked into GitHub: The basic sequential algorithm is: (frequencies (mapcat get-words pages)) If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my MacBook Pro. One way to parallelise it is to create a parallel version of frequencies that uses reducers: (defn frequencies-fold [words] (r/fold (partial merge-with +) (fn
Re: Parallelising over a lazy sequence - request for help
On 28 Sep 2013, at 01:22, Rich Morin r...@cfcl.com wrote: On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. ... Ordered; PDF just arrived (:-). Cool - very interested to hear your feedback once you've had a chance to read it. I don't know yet whether the book has anything like this, but I'd like to see a table that shows which concurrency and parallelism approaches are supported (and to what extent) by various languages. So, actors, agents, queues, reducers (etc) would be on one axis; Clojure, Erlang, Go, and Ruby (etc) would be on the other. I'd expect Clojure to have pretty complete coverage; Ruby, not so much. That might make a good appendix or section in the (as yet unwritten) wrap-up chapter - thanks for the suggestion. And yes, I suspect that you're right that Clojure will fare well in the comparison :-) -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 01:22, Rich Morin r...@cfcl.com wrote: On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. ... Ordered; PDF just arrived (:-). I don't know yet whether the book has anything like this, but I'd like to see a table that shows which concurrency and parallelism approaches are supported (and to what extent) by various languages. So, actors, agents, queues, reducers (etc) would be on one axis; Clojure, Erlang, Go, and Ruby (etc) would be on the other. I'd expect Clojure to have pretty complete coverage; Ruby, not so much. -r -- http://www.cfcl.com/rdmRich Morin http://www.cfcl.com/rdm/resume r...@cfcl.com http://www.cfcl.com/rdm/weblog +1 650-873-7841 Software system design, development, and documentation -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
I do not know about the most important parts of your performance difficulties, but on a more trivial point I might be able to shed some light. See the ClojureDocs page for pmap, which refers to the page for future, linked below. If you call (shutdown-agents) the 60-second wait to exit should go away. http://clojuredocs.org/clojure_core/clojure.core/future Andy Sent from my iPhone On Sep 28, 2013, at 1:41 AM, Paul Butcher p...@paulbutcher.com wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml sequential Elapsed time: 23630.443 msecs $ lein run 1 ~/enwiki.xml fold Elapsed time: 8697.79 msecs $ lein run 1 ~/enwiki.xml pmap 1 Elapsed time: 27393.703 msecs $ lein run 1 ~/enwiki.xml pthenf 1 Elapsed time: 37263.193 msecs As you can see, the the foldable-seq version gives an almost 3x speedup relative to the sequential version, and both the partition-then-pmap and partition-then-fold versions are significantly slower. The last argument for the partition-then-pmap and partition-then-fold versions is a partition size. I've tried various different sizes with no material effect: $ lein run 1 ~/enwiki.xml pthenf 1000 Elapsed time: 43187.385 msecs $ lein run 1 ~/enwiki.xml pthenf 10 Elapsed time: 35702.651 msecs $ lein run 1 ~/enwiki.xml pmap 1000 Elapsed time: 34087.314 msecs $ lein run 1 ~/enwiki.xml pmap 10 Elapsed time: 47340.969 msecs The performance of the partition-then-pmap version is actually much worse than the numbers above suggest. There's something very weird going on with (I guess) garbage collection - it takes a *long* time to finish after printing the elapsed time and the performance is completely pathological with larger page counts. Bottom line: the only way I've been able to obtain any kind of speedup remains foldable-seq. I'd be very grateful indeed if you could take a look at how I've implemented partition-then-fold to make sure that I've correctly captured your intent. Or if you have any suggestions for anything else that might work, or to explain the poor performance of partition-then-pmap and partition-then-fold. My guess is that the problem with partition-then-fold is the copying that's going on during the (into [] %). I can see that it is operating in parallel because the number of cores in use goes up, but the net effect is an overall slowdown rather than a speedup. That it performs worse than foldable-seq isn't surprising to me, given that it introduces an unnecessary copy. I still think that it's a crying shame to disallow folding over sequences - as the above shows, the gains both in performance and programming ease are significant, and it would only take a small modification to the reducers API to fix the holding-onto-head problem. What would be the downside of making this modification and allowing foldable sequences? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote: Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Enjoying your book! Stu On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have
Re: Parallelising over a lazy sequence - request for help
I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250 JW On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml sequential Elapsed time: 23630.443 msecs $ lein run 1 ~/enwiki.xml fold Elapsed time: 8697.79 msecs $ lein run 1 ~/enwiki.xml pmap 1 Elapsed time: 27393.703 msecs $ lein run 1 ~/enwiki.xml pthenf 1 Elapsed time: 37263.193 msecs As you can see, the the foldable-seq version gives an almost 3x speedup relative to the sequential version, and both the partition-then-pmap and partition-then-fold versions are significantly slower. The last argument for the partition-then-pmap and partition-then-fold versions is a partition size. I've tried various different sizes with no material effect: $ lein run 1 ~/enwiki.xml pthenf 1000 Elapsed time: 43187.385 msecs $ lein run 1 ~/enwiki.xml pthenf 10 Elapsed time: 35702.651 msecs $ lein run 1 ~/enwiki.xml pmap 1000 Elapsed time: 34087.314 msecs $ lein run 1 ~/enwiki.xml pmap 10 Elapsed time: 47340.969 msecs The performance of the partition-then-pmap version is actually much worse than the numbers above suggest. There's something very weird going on with (I guess) garbage collection - it takes a *long* time to finish after printing the elapsed time and the performance is completely pathological with larger page counts. Bottom line: the only way I've been able to obtain any kind of speedup remains foldable-seq. I'd be very grateful indeed if you could take a look at how I've implemented partition-then-fold to make sure that I've correctly captured your intent. Or if you have any suggestions for anything else that might work, or to explain the poor performance of partition-then-pmap and partition-then-fold. My guess is that the problem with partition-then-fold is the copying that's going on during the (into [] %). I can see that it is operating in parallel because the number of cores in use goes up, but the net effect is an overall slowdown rather than a speedup. That it performs worse than foldable-seq isn't surprising to me, given that it introduces an unnecessary copy. I still think that it's a crying shame to disallow folding over sequences - as the above shows, the gains both in performance and programming ease are significant, and it would only take a small modification to the reducers API to fix the holding-onto-head problem. What would be the downside of making this modification and allowing foldable sequences? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: pa...@paulbutcher.com javascript: AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: wrote: Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Enjoying your book! Stu On Sat, May 25, 2013 at 12:34 PM, Paul Butcher pa...@paulbutcher.comjavascript: wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have Java
Re: Parallelising over a lazy sequence - request for help
Ah - one mystery down. Thanks Andy! -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 16:26, Andy Fingerhut andy.finger...@gmail.com wrote: I do not know about the most important parts of your performance difficulties, but on a more trivial point I might be able to shed some light. See the ClojureDocs page for pmap, which refers to the page for future, linked below. If you call (shutdown-agents) the 60-second wait to exit should go away. http://clojuredocs.org/clojure_core/clojure.core/future Andy Sent from my iPhone On Sep 28, 2013, at 1:41 AM, Paul Butcher p...@paulbutcher.com wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml sequential Elapsed time: 23630.443 msecs $ lein run 1 ~/enwiki.xml fold Elapsed time: 8697.79 msecs $ lein run 1 ~/enwiki.xml pmap 1 Elapsed time: 27393.703 msecs $ lein run 1 ~/enwiki.xml pthenf 1 Elapsed time: 37263.193 msecs As you can see, the the foldable-seq version gives an almost 3x speedup relative to the sequential version, and both the partition-then-pmap and partition-then-fold versions are significantly slower. The last argument for the partition-then-pmap and partition-then-fold versions is a partition size. I've tried various different sizes with no material effect: $ lein run 1 ~/enwiki.xml pthenf 1000 Elapsed time: 43187.385 msecs $ lein run 1 ~/enwiki.xml pthenf 10 Elapsed time: 35702.651 msecs $ lein run 1 ~/enwiki.xml pmap 1000 Elapsed time: 34087.314 msecs $ lein run 1 ~/enwiki.xml pmap 10 Elapsed time: 47340.969 msecs The performance of the partition-then-pmap version is actually much worse than the numbers above suggest. There's something very weird going on with (I guess) garbage collection - it takes a *long* time to finish after printing the elapsed time and the performance is completely pathological with larger page counts. Bottom line: the only way I've been able to obtain any kind of speedup remains foldable-seq. I'd be very grateful indeed if you could take a look at how I've implemented partition-then-fold to make sure that I've correctly captured your intent. Or if you have any suggestions for anything else that might work, or to explain the poor performance of partition-then-pmap and partition-then-fold. My guess is that the problem with partition-then-fold is the copying that's going on during the (into [] %). I can see that it is operating in parallel because the number of cores in use goes up, but the net effect is an overall slowdown rather than a speedup. That it performs worse than foldable-seq isn't surprising to me, given that it introduces an unnecessary copy. I still think that it's a crying shame to disallow folding over sequences - as the above shows, the gains both in performance and programming ease are significant, and it would only take a small modification to the reducers API to fix the holding-onto-head problem. What would be the downside of making this modification and allowing foldable sequences? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote: Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Enjoying your book! Stu On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a
Re: Parallelising over a lazy sequence - request for help
On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250 That's fantastic news Jozef - is there any idea when that might be fixed? I can see that it's been triaged, but I'm not sure what exactly that means when it comes to the Clojure dev process? Could you expand on exactly what you mean when you say an input reader which returns a foldable collection? Bear in mind that the problem I'm trying to solve involves a 40GB Wikipedia dump, which clearly can't all be in RAM at one time. -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
I mean that you should forgot about lazy sequences and sequences in general, if you want to have a cutting edge performance with reducers. Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into the head. JW On Sat, Sep 28, 2013 at 6:24 PM, Paul Butcher p...@paulbutcher.com wrote: On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250 That's fantastic news Jozef - is there any idea when that might be fixed? I can see that it's been triaged, but I'm not sure what exactly that means when it comes to the Clojure dev process? Could you expand on exactly what you mean when you say an input reader which returns a foldable collection? Bear in mind that the problem I'm trying to solve involves a 40GB Wikipedia dump, which clearly can't all be in RAM at one time. -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
On 28 Sep 2013, at 17:42, Jozef Wagner jozef.wag...@gmail.com wrote: I mean that you should forgot about lazy sequences and sequences in general, if you want to have a cutting edge performance with reducers. Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into the head. OK - I buy that logic. But I'm unsure whether it can be applied to the case that I'm interested in. Firstly, I'm dealing with an XML file, so I need to parse it. Right now I'm using data.xml/parse to do so, which returns a lazy sequence. So, short of writing my own XML parser, I have to go via a lazy sequence at some point along the way? Secondly, the primary point of this is to exploit parallelism, not reducers per-se. So supporting CollReduce isn't enough, I also need to support CollFold. Which is exactly what foldable-seq does - I'm not sure how that differs from what you're proposing? Am I missing something? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
Well it should be possible to implement a foldseq variant which takes a reducible collection as an input. This would speed things, as you don't create so much garbage with reducers. XML parser which produces reducible collection will be a bit harder :). Anyway, I think the bottleneck in your code is at https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9 Instead of creating new persistent map for each word, you should use a transient here. JW On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote: On 28 Sep 2013, at 17:42, Jozef Wagner jozef@gmail.com javascript: wrote: I mean that you should forgot about lazy sequences and sequences in general, if you want to have a cutting edge performance with reducers. Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into the head. OK - I buy that logic. But I'm unsure whether it can be applied to the case that I'm interested in. Firstly, I'm dealing with an XML file, so I need to parse it. Right now I'm using data.xml/parse to do so, which returns a lazy sequence. So, short of writing my own XML parser, I have to go via a lazy sequence at some point along the way? Secondly, the primary point of this is to exploit parallelism, not reducers per-se. So supporting CollReduce isn't enough, I also need to support CollFold. Which is exactly what foldable-seq does - I'm not sure how that differs from what you're proposing? Am I missing something? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: pa...@paulbutcher.com javascript: AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
Or even better, use guava's Multiset there... On Saturday, September 28, 2013 8:51:56 PM UTC+2, Jozef Wagner wrote: Well it should be possible to implement a foldseq variant which takes a reducible collection as an input. This would speed things, as you don't create so much garbage with reducers. XML parser which produces reducible collection will be a bit harder :). Anyway, I think the bottleneck in your code is at https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9Instead of creating new persistent map for each word, you should use a transient here. JW On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote: On 28 Sep 2013, at 17:42, Jozef Wagner jozef@gmail.com wrote: I mean that you should forgot about lazy sequences and sequences in general, if you want to have a cutting edge performance with reducers. Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into the head. OK - I buy that logic. But I'm unsure whether it can be applied to the case that I'm interested in. Firstly, I'm dealing with an XML file, so I need to parse it. Right now I'm using data.xml/parse to do so, which returns a lazy sequence. So, short of writing my own XML parser, I have to go via a lazy sequence at some point along the way? Secondly, the primary point of this is to exploit parallelism, not reducers per-se. So supporting CollReduce isn't enough, I also need to support CollFold. Which is exactly what foldable-seq does - I'm not sure how that differs from what you're proposing? Am I missing something? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: pa...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
If a Clojure ticket is triaged, it means that one of the Clojure screeners believe the ticket's description describes a real issue with Clojure that ought to be changed in some way, and would like Rich Hickey to look at it and see whether he agress. If he does, it becomes vetted. A diagram of the process that Clojure tickets go through is shown at the link below (you will have to scroll down a bit to see it): http://dev.clojure.org/display/community/JIRA+workflow If you are curious to know about changes in any particular ticket's state as they occur, create an account on JIRA [1] and click the Watch link when viewing that ticket. You will be emailed about changes in its state. The soonest that any ticket might become part of a modified version of Clojure released as a JAR file is when Clojure 1.6.0 is released. If forced to guess, I would put a small bet on that happening in first half of 2014 some time, but I have no actual knowledge from those deciding when that will happen. In the past it seems to have been based more upon when certain features were added than on a particular amount of time passing. Until the release, there is no promise that any particular ticket will or will not become part of the release. [1] http://dev.clojure.org/jira/secure/Signup!default.jspa Andy On Sat, Sep 28, 2013 at 9:24 AM, Paul Butcher p...@paulbutcher.com wrote: On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250 That's fantastic news Jozef - is there any idea when that might be fixed? I can see that it's been triaged, but I'm not sure what exactly that means when it comes to the Clojure dev process? Could you expand on exactly what you mean when you say an input reader which returns a foldable collection? Bear in mind that the problem I'm trying to solve involves a 40GB Wikipedia dump, which clearly can't all be in RAM at one time. -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
On 28 Sep 2013, at 19:51, Jozef Wagner jozef.wag...@gmail.com wrote: Anyway, I think the bottleneck in your code is at https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9 Instead of creating new persistent map for each word, you should use a transient here. I would love it if that were true. However the code you reference is from the algorithm that uses foldable-seq and runs quickly (i.e. not the one that has the performance problem). -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
For your timings, I would also strongly recommend altering your project.clj to force the -server hotspot: :jvm-opts ^:replace [-Xmx1g -server ... and whatever else you want here ... ] By default lein will use tiered compilation to optimize repl startup, which is not what you want for timing. And I second Andy's (shutdown-agents) advice. On Saturday, September 28, 2013 3:41:20 AM UTC-5, Paul Butcher wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml sequential Elapsed time: 23630.443 msecs $ lein run 1 ~/enwiki.xml fold Elapsed time: 8697.79 msecs $ lein run 1 ~/enwiki.xml pmap 1 Elapsed time: 27393.703 msecs $ lein run 1 ~/enwiki.xml pthenf 1 Elapsed time: 37263.193 msecs As you can see, the the foldable-seq version gives an almost 3x speedup relative to the sequential version, and both the partition-then-pmap and partition-then-fold versions are significantly slower. The last argument for the partition-then-pmap and partition-then-fold versions is a partition size. I've tried various different sizes with no material effect: $ lein run 1 ~/enwiki.xml pthenf 1000 Elapsed time: 43187.385 msecs $ lein run 1 ~/enwiki.xml pthenf 10 Elapsed time: 35702.651 msecs $ lein run 1 ~/enwiki.xml pmap 1000 Elapsed time: 34087.314 msecs $ lein run 1 ~/enwiki.xml pmap 10 Elapsed time: 47340.969 msecs The performance of the partition-then-pmap version is actually much worse than the numbers above suggest. There's something very weird going on with (I guess) garbage collection - it takes a *long* time to finish after printing the elapsed time and the performance is completely pathological with larger page counts. Bottom line: the only way I've been able to obtain any kind of speedup remains foldable-seq. I'd be very grateful indeed if you could take a look at how I've implemented partition-then-fold to make sure that I've correctly captured your intent. Or if you have any suggestions for anything else that might work, or to explain the poor performance of partition-then-pmap and partition-then-fold. My guess is that the problem with partition-then-fold is the copying that's going on during the (into [] %). I can see that it is operating in parallel because the number of cores in use goes up, but the net effect is an overall slowdown rather than a speedup. That it performs worse than foldable-seq isn't surprising to me, given that it introduces an unnecessary copy. I still think that it's a crying shame to disallow folding over sequences - as the above shows, the gains both in performance and programming ease are significant, and it would only take a small modification to the reducers API to fix the holding-onto-head problem. What would be the downside of making this modification and allowing foldable sequences? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: pa...@paulbutcher.com javascript: AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: wrote: Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Enjoying your book! Stu On Sat, May 25, 2013 at 12:34 PM, Paul Butcher pa...@paulbutcher.comjavascript: wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have Java and Scala
Re: Parallelising over a lazy sequence - request for help
Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are many ways to do so - pmap is one easy mechanism to get parallelism over a lazy sequence but the chunking behavior of pmap can easily be non-optimal for particular use cases. Stu's other solution (prepare-with-partition-then-fold) reads chunks of data into memory, then uses f/j over the top of those chunks. Yet another possible solution is to use a pool of compute threads and to read from the stream, give those compute tasks to the pool to execute in a background thread, then reassemble the results at the end (or in a separate thread). The balance in this approach is to make the tasks the right chunkiness. Using a buffered queue is one technique to avoid having your input reader overload the capacity of the processing system. I would also mention that using transients as you build input collections will be a big win. Alex On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250 JW On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml sequential Elapsed time: 23630.443 msecs $ lein run 1 ~/enwiki.xml fold Elapsed time: 8697.79 msecs $ lein run 1 ~/enwiki.xml pmap 1 Elapsed time: 27393.703 msecs $ lein run 1 ~/enwiki.xml pthenf 1 Elapsed time: 37263.193 msecs As you can see, the the foldable-seq version gives an almost 3x speedup relative to the sequential version, and both the partition-then-pmap and partition-then-fold versions are significantly slower. The last argument for the partition-then-pmap and partition-then-fold versions is a partition size. I've tried various different sizes with no material effect: $ lein run 1 ~/enwiki.xml pthenf 1000 Elapsed time: 43187.385 msecs $ lein run 1 ~/enwiki.xml pthenf 10 Elapsed time: 35702.651 msecs $ lein run 1 ~/enwiki.xml pmap 1000 Elapsed time: 34087.314 msecs $ lein run 1 ~/enwiki.xml pmap 10 Elapsed time: 47340.969 msecs The performance of the partition-then-pmap version is actually much worse than the numbers above suggest. There's something very weird going on with (I guess) garbage collection - it takes a *long* time to finish after printing the elapsed time and the performance is completely pathological with larger page counts. Bottom line: the only way I've been able to obtain any kind of speedup remains foldable-seq. I'd be very grateful indeed if you could take a look at how I've implemented partition-then-fold to make sure that I've correctly captured your intent. Or if you have any suggestions for anything else that might work, or to explain the poor performance of partition-then-pmap and partition-then-fold. My guess is that the problem with partition-then-fold is the copying that's going on during the (into [] %). I can see that it is operating in parallel because the number of cores in use goes up, but the net effect is an overall slowdown rather than a speedup. That it performs worse than foldable-seq isn't surprising to me, given that it introduces an unnecessary copy. I still think that it's a crying shame to disallow folding over sequences - as the above shows, the gains both in performance and programming ease are significant, and it would only take a small modification to the reducers API to fix the holding-onto-head problem. What would be the downside of making this modification and allowing foldable sequences? -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell
Re: Parallelising over a lazy sequence - request for help
I am hoping that this will be fixed for 1.6 but no one is actually working on it afaik. If someone wants to take it on, I would GREATLY appreciate a patch on this ticket (must be a contributor of course). On Saturday, September 28, 2013 11:24:18 AM UTC-5, Paul Butcher wrote: On 28 Sep 2013, at 17:14, Jozef Wagner jozef@gmail.com javascript: wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250 That's fantastic news Jozef - is there any idea when that might be fixed? I can see that it's been triaged, but I'm not sure what exactly that means when it comes to the Clojure dev process? Could you expand on exactly what you mean when you say an input reader which returns a foldable collection? Bear in mind that the problem I'm trying to solve involves a 40GB Wikipedia dump, which clearly can't all be in RAM at one time. -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: pa...@paulbutcher.com javascript: AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
Can't your last possible solution rather be implemented on top of f/j pool? Is it possible to beat f/j pool performance with ad-hoc thread-pool in situations where there are thousands of tasks? JW On Sat, Sep 28, 2013 at 11:00 PM, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are many ways to do so - pmap is one easy mechanism to get parallelism over a lazy sequence but the chunking behavior of pmap can easily be non-optimal for particular use cases. Stu's other solution (prepare-with-partition-then-fold) reads chunks of data into memory, then uses f/j over the top of those chunks. Yet another possible solution is to use a pool of compute threads and to read from the stream, give those compute tasks to the pool to execute in a background thread, then reassemble the results at the end (or in a separate thread). The balance in this approach is to make the tasks the right chunkiness. Using a buffered queue is one technique to avoid having your input reader overload the capacity of the processing system. I would also mention that using transients as you build input collections will be a big win. Alex On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/**jira/browse/CLJ-1250http://dev.clojure.org/jira/browse/CLJ-1250 JW On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/**stuarthalloway/exploring-** clojure/blob/master/examples/**exploring/reducing_apple_pie.**cljhttps://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/**paulbutcher/parallel-word-**counthttps://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml sequential Elapsed time: 23630.443 msecs $ lein run 1 ~/enwiki.xml fold Elapsed time: 8697.79 msecs $ lein run 1 ~/enwiki.xml pmap 1 Elapsed time: 27393.703 msecs $ lein run 1 ~/enwiki.xml pthenf 1 Elapsed time: 37263.193 msecs As you can see, the the foldable-seq version gives an almost 3x speedup relative to the sequential version, and both the partition-then-pmap and partition-then-fold versions are significantly slower. The last argument for the partition-then-pmap and partition-then-fold versions is a partition size. I've tried various different sizes with no material effect: $ lein run 1 ~/enwiki.xml pthenf 1000 Elapsed time: 43187.385 msecs $ lein run 1 ~/enwiki.xml pthenf 10 Elapsed time: 35702.651 msecs $ lein run 1 ~/enwiki.xml pmap 1000 Elapsed time: 34087.314 msecs $ lein run 1 ~/enwiki.xml pmap 10 Elapsed time: 47340.969 msecs The performance of the partition-then-pmap version is actually much worse than the numbers above suggest. There's something very weird going on with (I guess) garbage collection - it takes a *long* time to finish after printing the elapsed time and the performance is completely pathological with larger page counts. Bottom line: the only way I've been able to obtain any kind of speedup remains foldable-seq. I'd be very grateful indeed if you could take a look at how I've implemented partition-then-fold to make sure that I've correctly captured your intent. Or if you have any suggestions for anything else that might work, or to explain the poor performance of partition-then-pmap and partition-then-fold. My guess is that the problem with partition-then-fold is the copying that's going on during the (into [] %). I can see that it is operating in parallel because the number of cores in use goes up, but the net effect is an overall slowdown rather than a speedup. That it performs worse than foldable-seq isn't surprising to me, given that
Re: Parallelising over a lazy sequence - request for help
Thanks Alex - I've made both of these changes. The shutdown-agents did get rid of the pause at the end of the pmap solution, and the -server argument made a very slight across-the-board performance improvement. But neither of them fundamentally change the basic result (that the implementation that uses foldable-seq gives a 3x speedup and the other implementations are slower than the sequential version). -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 21:47, Alex Miller a...@puredanger.com wrote: For your timings, I would also strongly recommend altering your project.clj to force the -server hotspot: :jvm-opts ^:replace [-Xmx1g -server ... and whatever else you want here ... ] By default lein will use tiered compilation to optimize repl startup, which is not what you want for timing. And I second Andy's (shutdown-agents) advice. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Parallelising over a lazy sequence - request for help
On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are many ways to do so - pmap is one easy mechanism to get parallelism over a lazy sequence but the chunking behavior of pmap can easily be non-optimal for particular use cases. Stu's other solution (prepare-with-partition-then-fold) reads chunks of data into memory, then uses f/j over the top of those chunks. Yes - I'm aware that that's how Stu's solution works. However when I use that in my example, the performance sucks. As I said in my earlier message, I believe that the reason for this is the additional copying that it involves. To date, I have exactly one implementation that has good parallel performance - the one that uses foldable-seq (which is 3x faster than the sequential version). Everything else I've tried performs worse than the sequential version. Yet another possible solution is to use a pool of compute threads and to read from the stream, give those compute tasks to the pool to execute in a background thread, then reassemble the results at the end (or in a separate thread). The balance in this approach is to make the tasks the right chunkiness. Using a buffered queue is one technique to avoid having your input reader overload the capacity of the processing system. That's basically *exactly* what the foldable-seq implementation does. I would also mention that using transients as you build input collections will be a big win. I'm sure that that's true - but the foldable-seq implementation is the one that would gain most from that, and that's the one that already runs 3x faster than the sequential implementation. But it's also the one that Stu objects to. What I'm trying to find is an implementation that Stu doesn't object to, but is faster than the sequential version of the algorithm. -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote: Reducers (and fork/join in general) are best suited for fine-grained computational parallelism on in-memory data. The problem in question involves processing more data than will fit in memory. So the question is then what is the best way to parallelize computation over the stream. There are many ways to do so - pmap is one easy mechanism to get parallelism over a lazy sequence but the chunking behavior of pmap can easily be non-optimal for particular use cases. Stu's other solution (prepare-with-partition-then-fold) reads chunks of data into memory, then uses f/j over the top of those chunks. Yet another possible solution is to use a pool of compute threads and to read from the stream, give those compute tasks to the pool to execute in a background thread, then reassemble the results at the end (or in a separate thread). The balance in this approach is to make the tasks the right chunkiness. Using a buffered queue is one technique to avoid having your input reader overload the capacity of the processing system. I would also mention that using transients as you build input collections will be a big win. Alex On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote: I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250 JW On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote: On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote: I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Hey Stuart, Thanks for getting back to me. I've updated the code for my word count example on GitHub with (I believe) something that works along the lines you suggest: https://github.com/paulbutcher/parallel-word-count Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of these runs counts the words in the first 1 pages of Wikipedia: $ lein run 1 ~/enwiki.xml
Re: Parallelising over a lazy sequence - request for help
Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj . I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Enjoying your book! Stu On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have Java and Scala versions that perform excellently) but I've been incapable of coming up with a nice solution in Clojure. The code I'm working with is checked into GitHubhttps://github.com/paulbutcher/parallel-word-count : The basic sequential algorithm is: (frequencies (mapcat get-words pages)) If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my MacBook Pro. One way to parallelise it is to create a parallel version of frequencies that uses reducers: (defn frequencies-fold [words] (r/fold (partial merge-with +) (fn [counts word] (assoc counts word (inc (get counts word 0 words)) And sure enough, if I use that, along with use the foldable-seqhttps://github.com/paulbutcher/parallel-word-count/blob/master/src/foldable_seq/core.clj utility I posted about herehttps://groups.google.com/d/msg/clojure/8RKCjF00ukQ/b5mmmOB5Uh4J are while ago it runs in ~8s, almost a 3x speedup, not bad given that the parallel version is unable to use transients. Unfortunately, as they currently stand, reducers have a fatal flaw that means that, even with foldable-seq, they're basically useless with lazy sequences. Reducers always hold onto the headhttps://groups.google.com/d/msg/clojure-dev/qJo7z_9CVdw/ARnHe1bThuMJ of the sequence they're given, so there's no way to use this approach for a complete Wikipedia dump (which runs to around 40GiB). So the other approach I've tried is to use pmap: (defn frequencies-pmap [words] (reduce (partial merge-with +) (pmap frequencies (partition-all 1 words But, for reasons I don't understand, this performs dreadfully - taking ~26s, i.e. significantly slower than the sequential version. I've tried playing with different partition sizes without materially affecting the result. So, what I'm looking for is either: a) An explanation for why the pmap-based solution performs so badly b) A way to fix the holding onto head problem that's inherent within reducers. With the last of these in mind, it strikes me that the problem fundamentally arises from the desire for reducers to follow the same basic API as normal code. So: (reduce (filter ... (map ... coll))) becomes: (r/fold (r/filter ... (r/map ... coll))) A very small change to the reducers API - passing the collection to the reduce and/or fold - would avoid the problem: (r/fold (r/filter ... (r/map ...)) coll) Anyway - I'd be very grateful for any insight into either of the above questions. Or for suggestions for an alternative approach that might be more fruitful. Many thanks in advance, -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure
Re: Parallelising over a lazy sequence - request for help
On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote: I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. ... Ordered; PDF just arrived (:-). I don't know yet whether the book has anything like this, but I'd like to see a table that shows which concurrency and parallelism approaches are supported (and to what extent) by various languages. So, actors, agents, queues, reducers (etc) would be on one axis; Clojure, Erlang, Go, and Ruby (etc) would be on the other. I'd expect Clojure to have pretty complete coverage; Ruby, not so much. -r -- http://www.cfcl.com/rdmRich Morin http://www.cfcl.com/rdm/resume r...@cfcl.com http://www.cfcl.com/rdm/weblog +1 650-873-7841 Software system design, development, and documentation -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Parallelising over a lazy sequence - request for help
I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here. The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have Java and Scala versions that perform excellently) but I've been incapable of coming up with a nice solution in Clojure. The code I'm working with is checked into GitHub: The basic sequential algorithm is: (frequencies (mapcat get-words pages)) If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my MacBook Pro. One way to parallelise it is to create a parallel version of frequencies that uses reducers: (defn frequencies-fold [words] (r/fold (partial merge-with +) (fn [counts word] (assoc counts word (inc (get counts word 0 words)) And sure enough, if I use that, along with use the foldable-seq utility I posted about here are while ago it runs in ~8s, almost a 3x speedup, not bad given that the parallel version is unable to use transients. Unfortunately, as they currently stand, reducers have a fatal flaw that means that, even with foldable-seq, they're basically useless with lazy sequences. Reducers always hold onto the head of the sequence they're given, so there's no way to use this approach for a complete Wikipedia dump (which runs to around 40GiB). So the other approach I've tried is to use pmap: (defn frequencies-pmap [words] (reduce (partial merge-with +) (pmap frequencies (partition-all 1 words But, for reasons I don't understand, this performs dreadfully - taking ~26s, i.e. significantly slower than the sequential version. I've tried playing with different partition sizes without materially affecting the result. So, what I'm looking for is either: a) An explanation for why the pmap-based solution performs so badly b) A way to fix the holding onto head problem that's inherent within reducers. With the last of these in mind, it strikes me that the problem fundamentally arises from the desire for reducers to follow the same basic API as normal code. So: (reduce (filter ... (map ... coll))) becomes: (r/fold (r/filter ... (r/map ... coll))) A very small change to the reducers API - passing the collection to the reduce and/or fold - would avoid the problem: (r/fold (r/filter ... (r/map ...)) coll) Anyway - I'd be very grateful for any insight into either of the above questions. Or for suggestions for an alternative approach that might be more fruitful. Many thanks in advance, -- paul.butcher-msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.