Re: Parallelising over a lazy sequence - request for help

2013-10-02 Thread Paul Butcher
Alan,

Apologies for the delayed reply - I remember Iota well (there was some 
cross-fertilisation between it and foldable-seq a few months back IIRC :-)

Having said that, I don't think that Iota will help in my particular situation 
(although I'd be delighted to be proven wrong)? Given that the file I'm 
processing is an XML file, and will therefore have to pass through an XML 
parser, unless I write an XML parser on top of the reducers framework, I'm 
stuck with dealing with sequences at some point along the way?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 30 Sep 2013, at 11:15, Alan Busby thebu...@gmail.com wrote:

 Sorry to jump in, but I thought it worthwhile to add a couple points; (sorry 
 for being brief)
 
 1. Reducers work fine with data much larger than memory, you just need to 
 mmap() the data you're working with so Clojure thinks everything is in memory 
 when it isn't. Reducer access is fairly sequential, not random, so spinning 
 disks work great here. 
 
 2. A 40GB XML file is very often many many smaller XML documents aggregated 
 together. It's often faster to separate each document into it's own line (via 
 various UNIX tools) and parse each line separately. I typically do something 
 like $ zcat bigxml.gz | tr '\n' ' ' | sed 's/foo/\nfoo/' | grep '^foo' 
  records.xml . 
 
 3. Check out the Iota library, https://github.com/thebusby/iota/ . I often 
 use for reducing over 100's of GB's worth of text data. It does what Jozef 
 suggests, and makes a text file a foldable collection.
 
 4. While pmap is great for advertising the power of Clojure, it's likely safe 
 to say that it should be ignored if you're actually looking for performance. 
 
 
 Hope this helps,
 Alan Busby
 
 
 
 
 -- 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-30 Thread Alan Busby
Sorry to jump in, but I thought it worthwhile to add a couple points;
(sorry for being brief)

1. Reducers work fine with data much larger than memory, you just need to
mmap() the data you're working with so Clojure thinks everything is in
memory when it isn't. Reducer access is fairly sequential, not random, so
spinning disks work great here.

2. A 40GB XML file is very often many many smaller XML documents aggregated
together. It's often faster to separate each document into it's own line
(via various UNIX tools) and parse each line separately. I typically do
something like $ zcat bigxml.gz | tr '\n' ' ' | sed 's/foo/\nfoo/' |
grep '^foo'  records.xml .

3. Check out the Iota library, https://github.com/thebusby/iota/ . I often
use for reducing over 100's of GB's worth of text data. It does what Jozef
suggests, and makes a text file a foldable collection.

4. While pmap is great for advertising the power of Clojure, it's likely
safe to say that it should be ignored if you're actually looking for
performance.


Hope this helps,
Alan Busby

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Stuart Halloway
To be clear, I don't object to the approach, only to naming it fold
and/or tying it to interfaces related to folding.

Stu


On Sat, Sep 28, 2013 at 5:29 PM, Paul Butcher p...@paulbutcher.com wrote:

 On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote:

 Reducers (and fork/join in general) are best suited for fine-grained
 computational parallelism on in-memory data. The problem in question
 involves processing more data than will fit in memory.

 So the question is then what is the best way to parallelize computation
 over the stream. There are many ways to do so - pmap is one easy mechanism
 to get parallelism over a lazy sequence but the chunking behavior of pmap
 can easily be non-optimal for particular use cases. Stu's other solution
 (prepare-with-partition-then-fold) reads chunks of data into memory, then
 uses f/j over the top of those chunks.


 Yes - I'm aware that that's how Stu's solution works. However when I use
 that in my example, the performance sucks. As I said in my earlier message,
 I believe that the reason for this is the additional copying that it
 involves.

 To date, I have exactly one implementation that has good parallel
 performance - the one that uses foldable-seq (which is 3x faster than the
 sequential version). Everything else I've tried performs worse than the
 sequential version.

 Yet another possible solution is to use a pool of compute threads and to
 read from the stream, give those compute tasks to the pool to execute in a
 background thread, then reassemble the results at the end (or in a separate
 thread). The balance in this approach is to make the tasks the right
 chunkiness. Using a buffered queue is one technique to avoid having your
 input reader overload the capacity of the processing system.


 That's basically *exactly* what the foldable-seq implementation does.

 I would also mention that using transients as you build input collections
 will be a big win.


 I'm sure that that's true - but the foldable-seq implementation is the one
 that would gain most from that, and that's the one that already runs 3x
 faster than the sequential implementation. But it's also the one that Stu
 objects to. What I'm trying to find is an implementation that Stu doesn't
 object to, but is faster than the sequential version of the algorithm.

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher

 On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote:

 Reducers (and fork/join in general) are best suited for fine-grained
 computational parallelism on in-memory data. The problem in question
 involves processing more data than will fit in memory.

 So the question is then what is the best way to parallelize computation
 over the stream. There are many ways to do so - pmap is one easy mechanism
 to get parallelism over a lazy sequence but the chunking behavior of pmap
 can easily be non-optimal for particular use cases. Stu's other solution
 (prepare-with-partition-then-fold) reads chunks of data into memory, then
 uses f/j over the top of those chunks.

 Yet another possible solution is to use a pool of compute threads and to
 read from the stream, give those compute tasks to the pool to execute in a
 background thread, then reassemble the results at the end (or in a separate
 thread). The balance in this approach is to make the tasks the right
 chunkiness. Using a buffered queue is one technique to avoid having your
 input reader overload the capacity of the processing system.

 I would also mention that using transients as you build input collections
 will be a big win.

 Alex

 On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:

 I would go a bit more further and suggest that you do not use sequences
 at all and work only with reducible/foldable collections. Make an input
 reader which returns a foldable collection and you will have the most
 performant solution. The thing about holding into the head is being worked
 on right now, see 
 http://dev.clojure.org/**jira/browse/CLJ-1250http://dev.clojure.org/jira/browse/CLJ-1250

 JW

 On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote:

 I have posted an example that shows partition-then-fold at
 https://github.com/**stuarthalloway/exploring-**
 clojure/blob/master/examples/**exploring/reducing_apple_pie.**cljhttps://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
 .

 I would be curious to know how this approach performs with your data.
  With the generated data I used, the partition+fold and partition+pmap
 approaches both used most of my cores and had similar perf.


 Hey Stuart,

 Thanks for getting back to me.

 I've updated the code for my word 

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Paul, is there any easy way to get the (small) dataset you're working with, 
so we can run your actual code against the same data?

On Saturday, May 25, 2013 9:34:15 AM UTC-7, Paul Butcher wrote:


 The example counts the words contained within a Wikipedia dump. It should 
 respond well to parallelisation (I have Java and Scala versions that 
 perform excellently) but I've been incapable of coming up with a nice 
 solution in Clojure.



-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Butcher
On 29 Sep 2013, at 22:58, Paul Mooser taron...@gmail.com wrote:

 Paul, is there any easy way to get the (small) dataset you're working with, 
 so we can run your actual code against the same data?

The dataset I'm using is a Wikipedia dump, which hardly counts as small :-)

Having said that, the first couple of million lines is all you need to 
reproduce the results I'm getting, which you can download with:

curl 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | 
bunzip2 | head -n 200  enwiki-short.xml

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser
Thanks - when I said small, I was referring to the fact that your tests 
were using the first 1 pages, as opposed to the entire data dump. Sorry 
if I was unclear or misunderstood. 

On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote:

 The dataset I'm using is a Wikipedia dump, which hardly counts as small 
 :-)


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Brian Craft
On the other hand it is 2013, not 2003. 40G is small in terms of modern 
hardware. Terabyte ram servers have been available for awhile, at prices 
within the reach of many projects. Large data in this decade is measured 
in petabytes, at least.


On Sunday, September 29, 2013 5:13:14 PM UTC-7, Paul Mooser wrote:

 Thanks - when I said small, I was referring to the fact that your tests 
 were using the first 1 pages, as opposed to the entire data dump. Sorry 
 if I was unclear or misunderstood. 

 On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote:

 The dataset I'm using is a Wikipedia dump, which hardly counts as small 
 :-)



-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
 
 I would be curious to know how this approach performs with your data.  With 
 the generated data I used, the partition+fold and partition+pmap approaches 
 both used most of my cores and had similar perf.

Hey Stuart, 

Thanks for getting back to me.

I've updated the code for my word count example on GitHub with (I believe) 
something that works along the lines you suggest:

https://github.com/paulbutcher/parallel-word-count

Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of 
these runs counts the words in the first 1 pages of Wikipedia:

 $ lein run 1 ~/enwiki.xml sequential
 Elapsed time: 23630.443 msecs
 $ lein run 1 ~/enwiki.xml fold
 Elapsed time: 8697.79 msecs
 $ lein run 1 ~/enwiki.xml pmap 1
 Elapsed time: 27393.703 msecs
 $ lein run 1 ~/enwiki.xml pthenf 1
 Elapsed time: 37263.193 msecs

As you can see, the the foldable-seq version gives an almost 3x speedup 
relative to the sequential version, and both the partition-then-pmap and 
partition-then-fold versions are significantly slower.

The last argument for the partition-then-pmap and partition-then-fold versions 
is a partition size. I've tried various different sizes with no material effect:

 $ lein run 1 ~/enwiki.xml pthenf 1000
 Elapsed time: 43187.385 msecs
 $ lein run 1 ~/enwiki.xml pthenf 10
 Elapsed time: 35702.651 msecs
 $ lein run 1 ~/enwiki.xml pmap 1000
 Elapsed time: 34087.314 msecs
 $ lein run 1 ~/enwiki.xml pmap 10
 Elapsed time: 47340.969 msecs

The performance of the partition-then-pmap version is actually much worse than 
the numbers above suggest. There's something very weird going on with (I guess) 
garbage collection - it takes a *long* time to finish after printing the 
elapsed time and the performance is completely pathological with larger page 
counts.

Bottom line: the only way I've been able to obtain any kind of speedup remains 
foldable-seq.

I'd be very grateful indeed if you could take a look at how I've implemented 
partition-then-fold to make sure that I've correctly captured your intent. Or 
if you have any suggestions for anything else that might work, or to explain 
the poor performance of partition-then-pmap and partition-then-fold.

My guess is that the problem with partition-then-fold is the copying that's 
going on during the (into [] %). I can see that it is operating in parallel 
because the number of cores in use goes up, but the net effect is an overall 
slowdown rather than a speedup.

That it performs worse than foldable-seq isn't surprising to me, given that it 
introduces an unnecessary copy.

I still think that it's a crying shame to disallow folding over sequences - as 
the above shows, the gains both in performance and programming ease are 
significant, and it would only take a small modification to the reducers API to 
fix the holding-onto-head problem. What would be the downside of making this 
modification and allowing foldable sequences?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

 Hi Paul,
 
 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
 
 I would be curious to know how this approach performs with your data.  With 
 the generated data I used, the partition+fold and partition+pmap approaches 
 both used most of my cores and had similar perf.
 
 Enjoying your book!
 
 Stu
 
 
 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a book on concurrent/parallel development for The 
 Pragmatic Programmers. One of the subjects I'm covering is parallel 
 programming in Clojure, but I've hit a roadblock with one of the examples. 
 I'm hoping that I can get some help to work through it here.
 
 The example counts the words contained within a Wikipedia dump. It should 
 respond well to parallelisation (I have Java and Scala versions that perform 
 excellently) but I've been incapable of coming up with a nice solution in 
 Clojure.
 
 The code I'm working with is checked into GitHub: 
 
 The basic sequential algorithm is:
 
 (frequencies (mapcat get-words pages))
 
 
 If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my 
 MacBook Pro.
 
 One way to parallelise it is to create a parallel version of frequencies that 
 uses reducers:
 
 (defn frequencies-fold [words]
   (r/fold (partial merge-with +)
   (fn 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 01:22, Rich Morin r...@cfcl.com wrote:

 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a book on concurrent/parallel development for The 
 Pragmatic Programmers. ...
 
 Ordered; PDF just arrived (:-).

Cool - very interested to hear your feedback once you've had a chance to read 
it.

 I don't know yet whether the book has anything like this, but I'd
 like to see a table that shows which concurrency and parallelism
 approaches are supported (and to what extent) by various languages.
 
 So, actors, agents, queues, reducers (etc) would be on one axis;
 Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
 expect Clojure to have pretty complete coverage; Ruby, not so much.

That might make a good appendix or section in the (as yet unwritten) wrap-up 
chapter - thanks for the suggestion.

And yes, I suspect that you're right that Clojure will fare well in the 
comparison :-)

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 01:22, Rich Morin r...@cfcl.com wrote:

 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a book on concurrent/parallel development for The 
 Pragmatic Programmers. ...
 
 Ordered; PDF just arrived (:-).
 
 
 I don't know yet whether the book has anything like this, but I'd
 like to see a table that shows which concurrency and parallelism
 approaches are supported (and to what extent) by various languages.
 
 So, actors, agents, queues, reducers (etc) would be on one axis;
 Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
 expect Clojure to have pretty complete coverage; Ruby, not so much.
 
 -r
 
 -- 
 http://www.cfcl.com/rdmRich Morin
 http://www.cfcl.com/rdm/resume r...@cfcl.com
 http://www.cfcl.com/rdm/weblog +1 650-873-7841
 
 Software system design, development, and documentation
 
 
 -- 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
I do not know about the most important parts of your performance difficulties, 
but on a more trivial point I might be able to shed some light.

See the ClojureDocs page for pmap, which refers to the page for future, linked 
below.  If you call (shutdown-agents) the 60-second wait to exit should go away.

http://clojuredocs.org/clojure_core/clojure.core/future

Andy

Sent from my iPhone

On Sep 28, 2013, at 1:41 AM, Paul Butcher p...@paulbutcher.com wrote:

 On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:
 
 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
 
 I would be curious to know how this approach performs with your data.  With 
 the generated data I used, the partition+fold and partition+pmap approaches 
 both used most of my cores and had similar perf.
 
 Hey Stuart, 
 
 Thanks for getting back to me.
 
 I've updated the code for my word count example on GitHub with (I believe) 
 something that works along the lines you suggest:
 
 https://github.com/paulbutcher/parallel-word-count
 
 Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each 
 of these runs counts the words in the first 1 pages of Wikipedia:
 
 $ lein run 1 ~/enwiki.xml sequential
 Elapsed time: 23630.443 msecs
 $ lein run 1 ~/enwiki.xml fold
 Elapsed time: 8697.79 msecs
 $ lein run 1 ~/enwiki.xml pmap 1
 Elapsed time: 27393.703 msecs
 $ lein run 1 ~/enwiki.xml pthenf 1
 Elapsed time: 37263.193 msecs
 
 As you can see, the the foldable-seq version gives an almost 3x speedup 
 relative to the sequential version, and both the partition-then-pmap and 
 partition-then-fold versions are significantly slower.
 
 The last argument for the partition-then-pmap and partition-then-fold 
 versions is a partition size. I've tried various different sizes with no 
 material effect:
 
 $ lein run 1 ~/enwiki.xml pthenf 1000
 Elapsed time: 43187.385 msecs
 $ lein run 1 ~/enwiki.xml pthenf 10
 Elapsed time: 35702.651 msecs
 $ lein run 1 ~/enwiki.xml pmap 1000
 Elapsed time: 34087.314 msecs
 $ lein run 1 ~/enwiki.xml pmap 10
 Elapsed time: 47340.969 msecs
 
 The performance of the partition-then-pmap version is actually much worse 
 than the numbers above suggest. There's something very weird going on with (I 
 guess) garbage collection - it takes a *long* time to finish after printing 
 the elapsed time and the performance is completely pathological with larger 
 page counts.
 
 Bottom line: the only way I've been able to obtain any kind of speedup 
 remains foldable-seq.
 
 I'd be very grateful indeed if you could take a look at how I've implemented 
 partition-then-fold to make sure that I've correctly captured your intent. Or 
 if you have any suggestions for anything else that might work, or to explain 
 the poor performance of partition-then-pmap and partition-then-fold.
 
 My guess is that the problem with partition-then-fold is the copying that's 
 going on during the (into [] %). I can see that it is operating in parallel 
 because the number of cores in use goes up, but the net effect is an overall 
 slowdown rather than a speedup.
 
 That it performs worse than foldable-seq isn't surprising to me, given that 
 it introduces an unnecessary copy.
 
 I still think that it's a crying shame to disallow folding over sequences - 
 as the above shows, the gains both in performance and programming ease are 
 significant, and it would only take a small modification to the reducers API 
 to fix the holding-onto-head problem. What would be the downside of making 
 this modification and allowing foldable sequences?
 
 --
 paul.butcher-msgCount++
 
 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?
 
 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher
 
 On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:
 
 Hi Paul,
 
 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
 
 I would be curious to know how this approach performs with your data.  With 
 the generated data I used, the partition+fold and partition+pmap approaches 
 both used most of my cores and had similar perf.
 
 Enjoying your book!
 
 Stu
 
 
 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a book on concurrent/parallel development for The 
 Pragmatic Programmers. One of the subjects I'm covering is parallel 
 programming in Clojure, but I've hit a roadblock with one of the examples. 
 I'm hoping that I can get some help to work through it here.
 
 The example counts the words contained within a Wikipedia dump. It should 
 respond well to parallelisation (I have 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I would go a bit more further and suggest that you do not use sequences at 
all and work only with reducible/foldable collections. Make an input reader 
which returns a foldable collection and you will have the most performant 
solution. The thing about holding into the head is being worked on right 
now, see http://dev.clojure.org/jira/browse/CLJ-1250

JW

On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: 
 wrote:

 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
 .

 I would be curious to know how this approach performs with your data. 
  With the generated data I used, the partition+fold and partition+pmap 
 approaches both used most of my cores and had similar perf.


 Hey Stuart, 

 Thanks for getting back to me.

 I've updated the code for my word count example on GitHub with (I believe) 
 something that works along the lines you suggest:

 https://github.com/paulbutcher/parallel-word-count

 Here are some sample runs on my machine (a 4-core retina MacBook Pro). 
 Each of these runs counts the words in the first 1 pages of Wikipedia:

 $ lein run 1 ~/enwiki.xml sequential
 Elapsed time: 23630.443 msecs
 $ lein run 1 ~/enwiki.xml fold
 Elapsed time: 8697.79 msecs
 $ lein run 1 ~/enwiki.xml pmap 1
 Elapsed time: 27393.703 msecs
 $ lein run 1 ~/enwiki.xml pthenf 1
 Elapsed time: 37263.193 msecs


 As you can see, the the foldable-seq version gives an almost 3x speedup 
 relative to the sequential version, and both the partition-then-pmap and 
 partition-then-fold versions are significantly slower.

 The last argument for the partition-then-pmap and partition-then-fold 
 versions is a partition size. I've tried various different sizes with no 
 material effect:

 $ lein run 1 ~/enwiki.xml pthenf 1000
 Elapsed time: 43187.385 msecs
 $ lein run 1 ~/enwiki.xml pthenf 10
 Elapsed time: 35702.651 msecs
 $ lein run 1 ~/enwiki.xml pmap 1000
 Elapsed time: 34087.314 msecs
 $ lein run 1 ~/enwiki.xml pmap 10
 Elapsed time: 47340.969 msecs


 The performance of the partition-then-pmap version is actually much worse 
 than the numbers above suggest. There's something very weird going on with 
 (I guess) garbage collection - it takes a *long* time to finish after 
 printing the elapsed time and the performance is completely pathological 
 with larger page counts.

 Bottom line: the only way I've been able to obtain any kind of speedup 
 remains foldable-seq.

 I'd be very grateful indeed if you could take a look at how I've 
 implemented partition-then-fold to make sure that I've correctly captured 
 your intent. Or if you have any suggestions for anything else that might 
 work, or to explain the poor performance of partition-then-pmap and 
 partition-then-fold.

 My guess is that the problem with partition-then-fold is the copying 
 that's going on during the (into [] %). I can see that it is operating in 
 parallel because the number of cores in use goes up, but the net effect is 
 an overall slowdown rather than a speedup.

 That it performs worse than foldable-seq isn't surprising to me, given 
 that it introduces an unnecessary copy.

 I still think that it's a crying shame to disallow folding over sequences 
 - as the above shows, the gains both in performance and programming ease 
 are significant, and it would only take a small modification to the 
 reducers API to fix the holding-onto-head problem. What would be the 
 downside of making this modification and allowing foldable sequences?

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: pa...@paulbutcher.com javascript:
 AIM: paulrabutcher
 Skype: paulrabutcher
  
 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: 
 wrote:

 Hi Paul,

 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
 .

 I would be curious to know how this approach performs with your data. 
  With the generated data I used, the partition+fold and partition+pmap 
 approaches both used most of my cores and had similar perf.

 Enjoying your book!

 Stu


 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher 
 pa...@paulbutcher.comjavascript:
  wrote:

 I'm currently working on a book on concurrent/parallel development for 
 The Pragmatic Programmers. One of the subjects I'm covering is parallel 
 programming in Clojure, but I've hit a roadblock with one of the examples. 
 I'm hoping that I can get some help to work through it here.

 The example counts the words contained within a Wikipedia dump. It should 
 respond well to parallelisation (I have Java 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Ah - one mystery down. Thanks Andy!

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 16:26, Andy Fingerhut andy.finger...@gmail.com wrote:

 I do not know about the most important parts of your performance 
 difficulties, but on a more trivial point I might be able to shed some light.
 
 See the ClojureDocs page for pmap, which refers to the page for future, 
 linked below.  If you call (shutdown-agents) the 60-second wait to exit 
 should go away.
 
 http://clojuredocs.org/clojure_core/clojure.core/future
 
 Andy
 
 Sent from my iPhone
 
 On Sep 28, 2013, at 1:41 AM, Paul Butcher p...@paulbutcher.com wrote:
 
 On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:
 
 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
 
 I would be curious to know how this approach performs with your data.  With 
 the generated data I used, the partition+fold and partition+pmap approaches 
 both used most of my cores and had similar perf.
 
 Hey Stuart, 
 
 Thanks for getting back to me.
 
 I've updated the code for my word count example on GitHub with (I believe) 
 something that works along the lines you suggest:
 
 https://github.com/paulbutcher/parallel-word-count
 
 Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each 
 of these runs counts the words in the first 1 pages of Wikipedia:
 
 $ lein run 1 ~/enwiki.xml sequential
 Elapsed time: 23630.443 msecs
 $ lein run 1 ~/enwiki.xml fold
 Elapsed time: 8697.79 msecs
 $ lein run 1 ~/enwiki.xml pmap 1
 Elapsed time: 27393.703 msecs
 $ lein run 1 ~/enwiki.xml pthenf 1
 Elapsed time: 37263.193 msecs
 
 As you can see, the the foldable-seq version gives an almost 3x speedup 
 relative to the sequential version, and both the partition-then-pmap and 
 partition-then-fold versions are significantly slower.
 
 The last argument for the partition-then-pmap and partition-then-fold 
 versions is a partition size. I've tried various different sizes with no 
 material effect:
 
 $ lein run 1 ~/enwiki.xml pthenf 1000
 Elapsed time: 43187.385 msecs
 $ lein run 1 ~/enwiki.xml pthenf 10
 Elapsed time: 35702.651 msecs
 $ lein run 1 ~/enwiki.xml pmap 1000
 Elapsed time: 34087.314 msecs
 $ lein run 1 ~/enwiki.xml pmap 10
 Elapsed time: 47340.969 msecs
 
 The performance of the partition-then-pmap version is actually much worse 
 than the numbers above suggest. There's something very weird going on with 
 (I guess) garbage collection - it takes a *long* time to finish after 
 printing the elapsed time and the performance is completely pathological 
 with larger page counts.
 
 Bottom line: the only way I've been able to obtain any kind of speedup 
 remains foldable-seq.
 
 I'd be very grateful indeed if you could take a look at how I've implemented 
 partition-then-fold to make sure that I've correctly captured your intent. 
 Or if you have any suggestions for anything else that might work, or to 
 explain the poor performance of partition-then-pmap and partition-then-fold.
 
 My guess is that the problem with partition-then-fold is the copying that's 
 going on during the (into [] %). I can see that it is operating in parallel 
 because the number of cores in use goes up, but the net effect is an overall 
 slowdown rather than a speedup.
 
 That it performs worse than foldable-seq isn't surprising to me, given that 
 it introduces an unnecessary copy.
 
 I still think that it's a crying shame to disallow folding over sequences - 
 as the above shows, the gains both in performance and programming ease are 
 significant, and it would only take a small modification to the reducers API 
 to fix the holding-onto-head problem. What would be the downside of making 
 this modification and allowing foldable sequences?
 
 --
 paul.butcher-msgCount++
 
 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?
 
 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher
 
 On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:
 
 Hi Paul,
 
 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
 
 I would be curious to know how this approach performs with your data.  With 
 the generated data I used, the partition+fold and partition+pmap approaches 
 both used most of my cores and had similar perf.
 
 Enjoying your book!
 
 Stu
 
 
 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote:

 I would go a bit more further and suggest that you do not use sequences at 
 all and work only with reducible/foldable collections. Make an input reader 
 which returns a foldable collection and you will have the most performant 
 solution. The thing about holding into the head is being worked on right now, 
 see http://dev.clojure.org/jira/browse/CLJ-1250


That's fantastic news Jozef - is there any idea when that might be fixed? I can 
see that it's been triaged, but I'm not sure what exactly that means when it 
comes to the Clojure dev process?

Could you expand on exactly what you mean when you say an input reader which 
returns a foldable collection? Bear in mind that the problem I'm trying to 
solve involves a 40GB Wikipedia dump, which clearly can't all be in RAM at one 
time.

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
I mean that you should forgot about lazy sequences and sequences in
general, if you want to have a cutting edge performance with reducers.
Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does
not hold into the head.

JW



On Sat, Sep 28, 2013 at 6:24 PM, Paul Butcher p...@paulbutcher.com wrote:

 On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote:

 I would go a bit more further and suggest that you do not use sequences at
 all and work only with reducible/foldable collections. Make an input reader
 which returns a foldable collection and you will have the most performant
 solution. The thing about holding into the head is being worked on right
 now, see http://dev.clojure.org/jira/browse/CLJ-1250


 That's fantastic news Jozef - is there any idea when that might be fixed?
 I can see that it's been triaged, but I'm not sure what exactly that means
 when it comes to the Clojure dev process?

 Could you expand on exactly what you mean when you say an input reader
 which returns a foldable collection? Bear in mind that the problem I'm
 trying to solve involves a 40GB Wikipedia dump, which clearly can't all be
 in RAM at one time.

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher

  --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 17:42, Jozef Wagner jozef.wag...@gmail.com wrote:

 I mean that you should forgot about lazy sequences and sequences in general, 
 if you want to have a cutting edge performance with reducers. Example of 
 reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into 
 the head.

OK - I buy that logic. But I'm unsure whether it can be applied to the case 
that I'm interested in.

Firstly, I'm dealing with an XML file, so I need to parse it. Right now I'm 
using data.xml/parse to do so, which returns a lazy sequence. So, short of 
writing my own XML parser, I have to go via a lazy sequence at some point along 
the way?

Secondly, the primary point of this is to exploit parallelism, not reducers 
per-se. So supporting CollReduce isn't enough, I also need to support CollFold. 
Which is exactly what foldable-seq does - I'm not sure how that differs from 
what you're proposing?

Am I missing something?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Well it should be possible to implement a foldseq variant which takes a 
reducible collection as an input. This would speed things, as you don't 
create so much garbage with reducers. XML parser which produces reducible 
collection will be a bit harder :). 

Anyway, I think the bottleneck in your code is at 
https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9
 
Instead of creating new persistent map for each word, you should use a 
transient here.

JW

On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 17:42, Jozef Wagner jozef@gmail.com javascript: 
 wrote:

 I mean that you should forgot about lazy sequences and sequences in 
 general, if you want to have a cutting edge performance with reducers. 
 Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does 
 not hold into the head.


 OK - I buy that logic. But I'm unsure whether it can be applied to the 
 case that I'm interested in.

 Firstly, I'm dealing with an XML file, so I need to parse it. Right now 
 I'm using data.xml/parse to do so, which returns a lazy sequence. So, short 
 of writing my own XML parser, I have to go via a lazy sequence at some 
 point along the way?

 Secondly, the primary point of this is to exploit parallelism, not 
 reducers per-se. So supporting CollReduce isn't enough, I also need to 
 support CollFold. Which is exactly what foldable-seq does - I'm not sure 
 how that differs from what you're proposing?

 Am I missing something?

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: pa...@paulbutcher.com javascript:
 AIM: paulrabutcher
 Skype: paulrabutcher
  


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Or even better, use guava's Multiset there...

On Saturday, September 28, 2013 8:51:56 PM UTC+2, Jozef Wagner wrote:

 Well it should be possible to implement a foldseq variant which takes a 
 reducible collection as an input. This would speed things, as you don't 
 create so much garbage with reducers. XML parser which produces reducible 
 collection will be a bit harder :). 

 Anyway, I think the bottleneck in your code is at 
 https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9Instead
  of creating new persistent map for each word, you should use a 
 transient here.

 JW

 On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 17:42, Jozef Wagner jozef@gmail.com wrote:

 I mean that you should forgot about lazy sequences and sequences in 
 general, if you want to have a cutting edge performance with reducers. 
 Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does 
 not hold into the head.


 OK - I buy that logic. But I'm unsure whether it can be applied to the 
 case that I'm interested in.

 Firstly, I'm dealing with an XML file, so I need to parse it. Right now 
 I'm using data.xml/parse to do so, which returns a lazy sequence. So, short 
 of writing my own XML parser, I have to go via a lazy sequence at some 
 point along the way?

 Secondly, the primary point of this is to exploit parallelism, not 
 reducers per-se. So supporting CollReduce isn't enough, I also need to 
 support CollFold. Which is exactly what foldable-seq does - I'm not sure 
 how that differs from what you're proposing?

 Am I missing something?

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: pa...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher
  


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut
If a Clojure ticket is triaged, it means that one of the Clojure screeners
believe the ticket's description describes a real issue with Clojure that
ought to be changed in some way, and would like Rich Hickey to look at it
and see whether he agress.  If he does, it becomes vetted.  A diagram of
the process that Clojure tickets go through is shown at the link below (you
will have to scroll down a bit to see it):

http://dev.clojure.org/display/community/JIRA+workflow

If you are curious to know about changes in any particular ticket's state
as they occur, create an account on JIRA [1] and click the Watch link
when viewing that ticket.  You will be emailed about changes in its state.

The soonest that any ticket might become part of a modified version of
Clojure released as a JAR file is when Clojure 1.6.0 is released.  If
forced to guess, I would put a small bet on that happening in first half of
2014 some time, but I have no actual knowledge from those deciding when
that will happen.  In the past it seems to have been based more upon when
certain features were added than on a particular amount of time passing.
Until the release, there is no promise that any particular ticket will or
will not become part of the release.

[1] http://dev.clojure.org/jira/secure/Signup!default.jspa

Andy



On Sat, Sep 28, 2013 at 9:24 AM, Paul Butcher p...@paulbutcher.com wrote:

 On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote:

 I would go a bit more further and suggest that you do not use sequences at
 all and work only with reducible/foldable collections. Make an input reader
 which returns a foldable collection and you will have the most performant
 solution. The thing about holding into the head is being worked on right
 now, see http://dev.clojure.org/jira/browse/CLJ-1250


 That's fantastic news Jozef - is there any idea when that might be fixed?
 I can see that it's been triaged, but I'm not sure what exactly that means
 when it comes to the Clojure dev process?

 Could you expand on exactly what you mean when you say an input reader
 which returns a foldable collection? Bear in mind that the problem I'm
 trying to solve involves a 40GB Wikipedia dump, which clearly can't all be
 in RAM at one time.

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher

  --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 19:51, Jozef Wagner jozef.wag...@gmail.com wrote:

 Anyway, I think the bottleneck in your code is at 
 https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9
  Instead of creating new persistent map for each word, you should use a 
 transient here.

I would love it if that were true. However the code you reference is from the 
algorithm that uses foldable-seq and runs quickly (i.e. not the one that has 
the performance problem).

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
For your timings, I would also strongly recommend altering your project.clj 
to force the -server hotspot:

  :jvm-opts ^:replace [-Xmx1g -server  ... and whatever else you want 
here ... ]

By default lein will use tiered compilation to optimize repl startup, which 
is not what you want for timing.

And I second Andy's (shutdown-agents) advice. 


On Saturday, September 28, 2013 3:41:20 AM UTC-5, Paul Butcher wrote:

 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: 
 wrote:

 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
 .

 I would be curious to know how this approach performs with your data. 
  With the generated data I used, the partition+fold and partition+pmap 
 approaches both used most of my cores and had similar perf.


 Hey Stuart, 

 Thanks for getting back to me.

 I've updated the code for my word count example on GitHub with (I believe) 
 something that works along the lines you suggest:

 https://github.com/paulbutcher/parallel-word-count

 Here are some sample runs on my machine (a 4-core retina MacBook Pro). 
 Each of these runs counts the words in the first 1 pages of Wikipedia:

 $ lein run 1 ~/enwiki.xml sequential
 Elapsed time: 23630.443 msecs
 $ lein run 1 ~/enwiki.xml fold
 Elapsed time: 8697.79 msecs
 $ lein run 1 ~/enwiki.xml pmap 1
 Elapsed time: 27393.703 msecs
 $ lein run 1 ~/enwiki.xml pthenf 1
 Elapsed time: 37263.193 msecs


 As you can see, the the foldable-seq version gives an almost 3x speedup 
 relative to the sequential version, and both the partition-then-pmap and 
 partition-then-fold versions are significantly slower.

 The last argument for the partition-then-pmap and partition-then-fold 
 versions is a partition size. I've tried various different sizes with no 
 material effect:

 $ lein run 1 ~/enwiki.xml pthenf 1000
 Elapsed time: 43187.385 msecs
 $ lein run 1 ~/enwiki.xml pthenf 10
 Elapsed time: 35702.651 msecs
 $ lein run 1 ~/enwiki.xml pmap 1000
 Elapsed time: 34087.314 msecs
 $ lein run 1 ~/enwiki.xml pmap 10
 Elapsed time: 47340.969 msecs


 The performance of the partition-then-pmap version is actually much worse 
 than the numbers above suggest. There's something very weird going on with 
 (I guess) garbage collection - it takes a *long* time to finish after 
 printing the elapsed time and the performance is completely pathological 
 with larger page counts.

 Bottom line: the only way I've been able to obtain any kind of speedup 
 remains foldable-seq.

 I'd be very grateful indeed if you could take a look at how I've 
 implemented partition-then-fold to make sure that I've correctly captured 
 your intent. Or if you have any suggestions for anything else that might 
 work, or to explain the poor performance of partition-then-pmap and 
 partition-then-fold.

 My guess is that the problem with partition-then-fold is the copying 
 that's going on during the (into [] %). I can see that it is operating in 
 parallel because the number of cores in use goes up, but the net effect is 
 an overall slowdown rather than a speedup.

 That it performs worse than foldable-seq isn't surprising to me, given 
 that it introduces an unnecessary copy.

 I still think that it's a crying shame to disallow folding over sequences 
 - as the above shows, the gains both in performance and programming ease 
 are significant, and it would only take a small modification to the 
 reducers API to fix the holding-onto-head problem. What would be the 
 downside of making this modification and allowing foldable sequences?

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: pa...@paulbutcher.com javascript:
 AIM: paulrabutcher
 Skype: paulrabutcher
  
 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript: 
 wrote:

 Hi Paul,

 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
 .

 I would be curious to know how this approach performs with your data. 
  With the generated data I used, the partition+fold and partition+pmap 
 approaches both used most of my cores and had similar perf.

 Enjoying your book!

 Stu


 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher 
 pa...@paulbutcher.comjavascript:
  wrote:

 I'm currently working on a book on concurrent/parallel development for 
 The Pragmatic Programmers. One of the subjects I'm covering is parallel 
 programming in Clojure, but I've hit a roadblock with one of the examples. 
 I'm hoping that I can get some help to work through it here.

 The example counts the words contained within a Wikipedia dump. It should 
 respond well to parallelisation (I have Java and Scala 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
Reducers (and fork/join in general) are best suited for fine-grained 
computational parallelism on in-memory data. The problem in question 
involves processing more data than will fit in memory.

So the question is then what is the best way to parallelize computation 
over the stream. There are many ways to do so - pmap is one easy mechanism 
to get parallelism over a lazy sequence but the chunking behavior of pmap 
can easily be non-optimal for particular use cases. Stu's other solution 
(prepare-with-partition-then-fold) reads chunks of data into memory, then 
uses f/j over the top of those chunks. 

Yet another possible solution is to use a pool of compute threads and to 
read from the stream, give those compute tasks to the pool to execute in a 
background thread, then reassemble the results at the end (or in a separate 
thread). The balance in this approach is to make the tasks the right 
chunkiness. Using a buffered queue is one technique to avoid having your 
input reader overload the capacity of the processing system.

I would also mention that using transients as you build input collections 
will be a big win. 

Alex

On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:

 I would go a bit more further and suggest that you do not use sequences at 
 all and work only with reducible/foldable collections. Make an input reader 
 which returns a foldable collection and you will have the most performant 
 solution. The thing about holding into the head is being worked on right 
 now, see http://dev.clojure.org/jira/browse/CLJ-1250

 JW

 On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote:

 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
 .

 I would be curious to know how this approach performs with your data. 
  With the generated data I used, the partition+fold and partition+pmap 
 approaches both used most of my cores and had similar perf.


 Hey Stuart, 

 Thanks for getting back to me.

 I've updated the code for my word count example on GitHub with (I 
 believe) something that works along the lines you suggest:

 https://github.com/paulbutcher/parallel-word-count

 Here are some sample runs on my machine (a 4-core retina MacBook Pro). 
 Each of these runs counts the words in the first 1 pages of Wikipedia:

 $ lein run 1 ~/enwiki.xml sequential
 Elapsed time: 23630.443 msecs
 $ lein run 1 ~/enwiki.xml fold
 Elapsed time: 8697.79 msecs
 $ lein run 1 ~/enwiki.xml pmap 1
 Elapsed time: 27393.703 msecs
 $ lein run 1 ~/enwiki.xml pthenf 1
 Elapsed time: 37263.193 msecs


 As you can see, the the foldable-seq version gives an almost 3x speedup 
 relative to the sequential version, and both the partition-then-pmap and 
 partition-then-fold versions are significantly slower.

 The last argument for the partition-then-pmap and partition-then-fold 
 versions is a partition size. I've tried various different sizes with no 
 material effect:

 $ lein run 1 ~/enwiki.xml pthenf 1000
 Elapsed time: 43187.385 msecs
 $ lein run 1 ~/enwiki.xml pthenf 10
 Elapsed time: 35702.651 msecs
 $ lein run 1 ~/enwiki.xml pmap 1000
 Elapsed time: 34087.314 msecs
 $ lein run 1 ~/enwiki.xml pmap 10
 Elapsed time: 47340.969 msecs


 The performance of the partition-then-pmap version is actually much worse 
 than the numbers above suggest. There's something very weird going on with 
 (I guess) garbage collection - it takes a *long* time to finish after 
 printing the elapsed time and the performance is completely pathological 
 with larger page counts.

 Bottom line: the only way I've been able to obtain any kind of speedup 
 remains foldable-seq.

 I'd be very grateful indeed if you could take a look at how I've 
 implemented partition-then-fold to make sure that I've correctly captured 
 your intent. Or if you have any suggestions for anything else that might 
 work, or to explain the poor performance of partition-then-pmap and 
 partition-then-fold.

 My guess is that the problem with partition-then-fold is the copying 
 that's going on during the (into [] %). I can see that it is operating in 
 parallel because the number of cores in use goes up, but the net effect is 
 an overall slowdown rather than a speedup.

 That it performs worse than foldable-seq isn't surprising to me, given 
 that it introduces an unnecessary copy.

 I still think that it's a crying shame to disallow folding over sequences 
 - as the above shows, the gains both in performance and programming ease 
 are significant, and it would only take a small modification to the 
 reducers API to fix the holding-onto-head problem. What would be the 
 downside of making this modification and allowing foldable sequences?

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller
I am hoping that this will be fixed for 1.6 but no one is actually 
working on it afaik. If someone wants to take it on, I would GREATLY 
appreciate a patch on this ticket (must be a contributor of course).

On Saturday, September 28, 2013 11:24:18 AM UTC-5, Paul Butcher wrote:

 On 28 Sep 2013, at 17:14, Jozef Wagner jozef@gmail.com javascript: 
 wrote:

 I would go a bit more further and suggest that you do not use sequences at 
 all and work only with reducible/foldable collections. Make an input reader 
 which returns a foldable collection and you will have the most performant 
 solution. The thing about holding into the head is being worked on right 
 now, see http://dev.clojure.org/jira/browse/CLJ-1250


 That's fantastic news Jozef - is there any idea when that might be fixed? 
 I can see that it's been triaged, but I'm not sure what exactly that means 
 when it comes to the Clojure dev process?

 Could you expand on exactly what you mean when you say an input reader 
 which returns a foldable collection? Bear in mind that the problem I'm 
 trying to solve involves a 40GB Wikipedia dump, which clearly can't all be 
 in RAM at one time.

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: pa...@paulbutcher.com javascript:
 AIM: paulrabutcher
 Skype: paulrabutcher
  


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner
Can't your last possible solution rather be implemented on top of f/j pool?
Is it possible to beat f/j pool performance with ad-hoc thread-pool in
situations where there are thousands of tasks?

JW


On Sat, Sep 28, 2013 at 11:00 PM, Alex Miller a...@puredanger.com wrote:

 Reducers (and fork/join in general) are best suited for fine-grained
 computational parallelism on in-memory data. The problem in question
 involves processing more data than will fit in memory.

 So the question is then what is the best way to parallelize computation
 over the stream. There are many ways to do so - pmap is one easy mechanism
 to get parallelism over a lazy sequence but the chunking behavior of pmap
 can easily be non-optimal for particular use cases. Stu's other solution
 (prepare-with-partition-then-fold) reads chunks of data into memory, then
 uses f/j over the top of those chunks.

 Yet another possible solution is to use a pool of compute threads and to
 read from the stream, give those compute tasks to the pool to execute in a
 background thread, then reassemble the results at the end (or in a separate
 thread). The balance in this approach is to make the tasks the right
 chunkiness. Using a buffered queue is one technique to avoid having your
 input reader overload the capacity of the processing system.

 I would also mention that using transients as you build input collections
 will be a big win.

 Alex

 On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:

 I would go a bit more further and suggest that you do not use sequences
 at all and work only with reducible/foldable collections. Make an input
 reader which returns a foldable collection and you will have the most
 performant solution. The thing about holding into the head is being worked
 on right now, see 
 http://dev.clojure.org/**jira/browse/CLJ-1250http://dev.clojure.org/jira/browse/CLJ-1250

 JW


 On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote:

 I have posted an example that shows partition-then-fold at
 https://github.com/**stuarthalloway/exploring-**
 clojure/blob/master/examples/**exploring/reducing_apple_pie.**cljhttps://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
 .

 I would be curious to know how this approach performs with your data.
  With the generated data I used, the partition+fold and partition+pmap
 approaches both used most of my cores and had similar perf.


 Hey Stuart,

 Thanks for getting back to me.

 I've updated the code for my word count example on GitHub with (I
 believe) something that works along the lines you suggest:

 https://github.com/**paulbutcher/parallel-word-**counthttps://github.com/paulbutcher/parallel-word-count

 Here are some sample runs on my machine (a 4-core retina MacBook Pro).
 Each of these runs counts the words in the first 1 pages of Wikipedia:

 $ lein run 1 ~/enwiki.xml sequential
 Elapsed time: 23630.443 msecs
 $ lein run 1 ~/enwiki.xml fold
 Elapsed time: 8697.79 msecs
 $ lein run 1 ~/enwiki.xml pmap 1
 Elapsed time: 27393.703 msecs
 $ lein run 1 ~/enwiki.xml pthenf 1
 Elapsed time: 37263.193 msecs


 As you can see, the the foldable-seq version gives an almost 3x speedup
 relative to the sequential version, and both the partition-then-pmap and
 partition-then-fold versions are significantly slower.

 The last argument for the partition-then-pmap and partition-then-fold
 versions is a partition size. I've tried various different sizes with no
 material effect:

 $ lein run 1 ~/enwiki.xml pthenf 1000
 Elapsed time: 43187.385 msecs
 $ lein run 1 ~/enwiki.xml pthenf 10
 Elapsed time: 35702.651 msecs
 $ lein run 1 ~/enwiki.xml pmap 1000
 Elapsed time: 34087.314 msecs
 $ lein run 1 ~/enwiki.xml pmap 10
 Elapsed time: 47340.969 msecs


 The performance of the partition-then-pmap version is actually much
 worse than the numbers above suggest. There's something very weird going on
 with (I guess) garbage collection - it takes a *long* time to finish after
 printing the elapsed time and the performance is completely pathological
 with larger page counts.

 Bottom line: the only way I've been able to obtain any kind of speedup
 remains foldable-seq.

 I'd be very grateful indeed if you could take a look at how I've
 implemented partition-then-fold to make sure that I've correctly captured
 your intent. Or if you have any suggestions for anything else that might
 work, or to explain the poor performance of partition-then-pmap and
 partition-then-fold.

 My guess is that the problem with partition-then-fold is the copying
 that's going on during the (into [] %). I can see that it is operating in
 parallel because the number of cores in use goes up, but the net effect is
 an overall slowdown rather than a speedup.

 That it performs worse than foldable-seq isn't surprising to me, given
 that 

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
Thanks Alex - I've made both of these changes. The shutdown-agents did get rid 
of the pause at the end of the pmap solution, and the -server argument made a 
very slight across-the-board performance improvement. But neither of them 
fundamentally change the basic result (that the implementation that uses 
foldable-seq gives a 3x speedup and the other implementations are slower than 
the sequential version).

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 21:47, Alex Miller a...@puredanger.com wrote:

 For your timings, I would also strongly recommend altering your project.clj 
 to force the -server hotspot:
 
   :jvm-opts ^:replace [-Xmx1g -server  ... and whatever else you want 
 here ... ]
 
 By default lein will use tiered compilation to optimize repl startup, which 
 is not what you want for timing.
 
 And I second Andy's (shutdown-agents) advice. 

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher
On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote:

 Reducers (and fork/join in general) are best suited for fine-grained 
 computational parallelism on in-memory data. The problem in question involves 
 processing more data than will fit in memory.
 
 So the question is then what is the best way to parallelize computation over 
 the stream. There are many ways to do so - pmap is one easy mechanism to get 
 parallelism over a lazy sequence but the chunking behavior of pmap can easily 
 be non-optimal for particular use cases. Stu's other solution 
 (prepare-with-partition-then-fold) reads chunks of data into memory, then 
 uses f/j over the top of those chunks. 

Yes - I'm aware that that's how Stu's solution works. However when I use that 
in my example, the performance sucks. As I said in my earlier message, I 
believe that the reason for this is the additional copying that it involves.

To date, I have exactly one implementation that has good parallel performance - 
the one that uses foldable-seq (which is 3x faster than the sequential 
version). Everything else I've tried performs worse than the sequential version.

 Yet another possible solution is to use a pool of compute threads and to read 
 from the stream, give those compute tasks to the pool to execute in a 
 background thread, then reassemble the results at the end (or in a separate 
 thread). The balance in this approach is to make the tasks the right 
 chunkiness. Using a buffered queue is one technique to avoid having your 
 input reader overload the capacity of the processing system.

That's basically *exactly* what the foldable-seq implementation does.

 I would also mention that using transients as you build input collections 
 will be a big win. 

I'm sure that that's true - but the foldable-seq implementation is the one that 
would gain most from that, and that's the one that already runs 3x faster than 
the sequential implementation. But it's also the one that Stu objects to. What 
I'm trying to find is an implementation that Stu doesn't object to, but is 
faster than the sequential version of the algorithm.

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote:

 Reducers (and fork/join in general) are best suited for fine-grained 
 computational parallelism on in-memory data. The problem in question involves 
 processing more data than will fit in memory.
 
 So the question is then what is the best way to parallelize computation over 
 the stream. There are many ways to do so - pmap is one easy mechanism to get 
 parallelism over a lazy sequence but the chunking behavior of pmap can easily 
 be non-optimal for particular use cases. Stu's other solution 
 (prepare-with-partition-then-fold) reads chunks of data into memory, then 
 uses f/j over the top of those chunks. 
 
 Yet another possible solution is to use a pool of compute threads and to read 
 from the stream, give those compute tasks to the pool to execute in a 
 background thread, then reassemble the results at the end (or in a separate 
 thread). The balance in this approach is to make the tasks the right 
 chunkiness. Using a buffered queue is one technique to avoid having your 
 input reader overload the capacity of the processing system.
 
 I would also mention that using transients as you build input collections 
 will be a big win. 
 
 Alex
 
 On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
 I would go a bit more further and suggest that you do not use sequences at 
 all and work only with reducible/foldable collections. Make an input reader 
 which returns a foldable collection and you will have the most performant 
 solution. The thing about holding into the head is being worked on right now, 
 see http://dev.clojure.org/jira/browse/CLJ-1250
 
 JW
 
 On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
 On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote:
 
 I have posted an example that shows partition-then-fold at 
 https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
 
 I would be curious to know how this approach performs with your data.  With 
 the generated data I used, the partition+fold and partition+pmap approaches 
 both used most of my cores and had similar perf.
 
 Hey Stuart, 
 
 Thanks for getting back to me.
 
 I've updated the code for my word count example on GitHub with (I believe) 
 something that works along the lines you suggest:
 
 https://github.com/paulbutcher/parallel-word-count
 
 Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each 
 of these runs counts the words in the first 1 pages of Wikipedia:
 
 $ lein run 1 ~/enwiki.xml 

Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Stuart Halloway
Hi Paul,

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
.

I would be curious to know how this approach performs with your data.  With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Enjoying your book!

Stu


On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:

 I'm currently working on a book on concurrent/parallel development for The
 Pragmatic Programmers. One of the subjects I'm covering is parallel
 programming in Clojure, but I've hit a roadblock with one of the examples.
 I'm hoping that I can get some help to work through it here.

 The example counts the words contained within a Wikipedia dump. It should
 respond well to parallelisation (I have Java and Scala versions that
 perform excellently) but I've been incapable of coming up with a nice
 solution in Clojure.

 The code I'm working with is checked into 
 GitHubhttps://github.com/paulbutcher/parallel-word-count
 :

 The basic sequential algorithm is:

 (frequencies (mapcat get-words pages))


 If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on
 my MacBook Pro.

 One way to parallelise it is to create a parallel version of frequencies
 that uses reducers:

 (defn frequencies-fold [words]
   (r/fold (partial merge-with +)
   (fn [counts word] (assoc counts word (inc (get counts word 0

   words))


 And sure enough, if I use that, along with use the 
 foldable-seqhttps://github.com/paulbutcher/parallel-word-count/blob/master/src/foldable_seq/core.clj
  utility
 I posted about 
 herehttps://groups.google.com/d/msg/clojure/8RKCjF00ukQ/b5mmmOB5Uh4J are
 while ago it runs in ~8s, almost a 3x speedup, not bad given that the
 parallel version is unable to use transients.

 Unfortunately, as they currently stand, reducers have a fatal flaw that
 means that, even with foldable-seq, they're basically useless with lazy
 sequences. Reducers always hold onto the 
 headhttps://groups.google.com/d/msg/clojure-dev/qJo7z_9CVdw/ARnHe1bThuMJ of
 the sequence they're given, so there's no way to use this approach for a
 complete Wikipedia dump (which runs to around 40GiB).

 So the other approach I've tried is to use pmap:

 (defn frequencies-pmap [words]
   (reduce (partial merge-with +)
 (pmap frequencies
   (partition-all 1 words


 But, for reasons I don't understand, this performs dreadfully - taking
 ~26s, i.e. significantly slower than the sequential version.

 I've tried playing with different partition sizes without materially
 affecting the result.

 So, what I'm looking for is either:

 a) An explanation for why the pmap-based solution performs so badly

 b) A way to fix the holding onto head problem that's inherent within
 reducers.

 With the last of these in mind, it strikes me that the problem
 fundamentally arises from the desire for reducers to follow the same basic
 API as normal code. So:

 (reduce (filter ... (map ... coll)))

 becomes:

 (r/fold (r/filter ... (r/map ... coll)))

 A very small change to the reducers API - passing the collection to the
 reduce and/or fold - would avoid the problem:

 (r/fold (r/filter ... (r/map ...)) coll)

 Anyway - I'd be very grateful for any insight into either of the above
 questions. Or for suggestions for an alternative approach that might be
 more fruitful.

 Many thanks in advance,

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher

  --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.




-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure 

Re: Parallelising over a lazy sequence - request for help

2013-09-27 Thread Rich Morin
 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a book on concurrent/parallel development for The 
 Pragmatic Programmers. ...

Ordered; PDF just arrived (:-).


I don't know yet whether the book has anything like this, but I'd
like to see a table that shows which concurrency and parallelism
approaches are supported (and to what extent) by various languages.

So, actors, agents, queues, reducers (etc) would be on one axis;
Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
expect Clojure to have pretty complete coverage; Ruby, not so much.

-r

 -- 
http://www.cfcl.com/rdmRich Morin
http://www.cfcl.com/rdm/resume r...@cfcl.com
http://www.cfcl.com/rdm/weblog +1 650-873-7841

Software system design, development, and documentation


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Parallelising over a lazy sequence - request for help

2013-05-25 Thread Paul Butcher
I'm currently working on a book on concurrent/parallel development for The 
Pragmatic Programmers. One of the subjects I'm covering is parallel programming 
in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that 
I can get some help to work through it here.

The example counts the words contained within a Wikipedia dump. It should 
respond well to parallelisation (I have Java and Scala versions that perform 
excellently) but I've been incapable of coming up with a nice solution in 
Clojure.

The code I'm working with is checked into GitHub: 

The basic sequential algorithm is:

 (frequencies (mapcat get-words pages))


If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my 
MacBook Pro.

One way to parallelise it is to create a parallel version of frequencies that 
uses reducers:

 (defn frequencies-fold [words]
   (r/fold (partial merge-with +)
   (fn [counts word] (assoc counts word (inc (get counts word 0
   words))

And sure enough, if I use that, along with use the foldable-seq utility I 
posted about here are while ago it runs in ~8s, almost a 3x speedup, not bad 
given that the parallel version is unable to use transients.

Unfortunately, as they currently stand, reducers have a fatal flaw that means 
that, even with foldable-seq, they're basically useless with lazy sequences. 
Reducers always hold onto the head of the sequence they're given, so there's no 
way to use this approach for a complete Wikipedia dump (which runs to around 
40GiB).

So the other approach I've tried is to use pmap:

 (defn frequencies-pmap [words]
   (reduce (partial merge-with +) 
 (pmap frequencies 
   (partition-all 1 words

But, for reasons I don't understand, this performs dreadfully - taking ~26s, 
i.e. significantly slower than the sequential version.

I've tried playing with different partition sizes without materially affecting 
the result.

So, what I'm looking for is either:

a) An explanation for why the pmap-based solution performs so badly

b) A way to fix the holding onto head problem that's inherent within reducers.

With the last of these in mind, it strikes me that the problem fundamentally 
arises from the desire for reducers to follow the same basic API as normal 
code. So:

(reduce (filter ... (map ... coll)))

becomes:

(r/fold (r/filter ... (r/map ... coll)))

A very small change to the reducers API - passing the collection to the reduce 
and/or fold - would avoid the problem:

(r/fold (r/filter ... (r/map ...)) coll)

Anyway - I'd be very grateful for any insight into either of the above 
questions. Or for suggestions for an alternative approach that might be more 
fruitful.

Many thanks in advance,

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.