subject:"Parallelising over a lazy sequence \- request for help"

Re: Parallelising over a lazy sequence - request for help

2013-10-02 Thread Paul Butcher

Alan,

Apologies for the delayed reply - I remember Iota well (there was some 
cross-fertilisation between it and foldable-seq a few months back IIRC :-)

Having said that, I don't think that Iota will help in my particular situation 
(although I'd be delighted to be proven wrong)? Given that the file I'm 
processing is an XML file, and will therefore have to pass through an XML 
parser, unless I write an XML parser on top of the reducers framework, I'm 
stuck with dealing with sequences at some point along the way?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 30 Sep 2013, at 11:15, Alan Busby thebu...@gmail.com wrote:

 Sorry to jump in, but I thought it worthwhile to add a couple points; (sorry 
 for being brief)
 
 1. Reducers work fine with data much larger than memory, you just need to 
 mmap() the data you're working with so Clojure thinks everything is in memory 
 when it isn't. Reducer access is fairly sequential, not random, so spinning 
 disks work great here. 
 
 2. A 40GB XML file is very often many many smaller XML documents aggregated 
 together. It's often faster to separate each document into it's own line (via 
 various UNIX tools) and parse each line separately. I typically do something 
 like $ zcat bigxml.gz | tr '\n' ' ' | sed 's/foo/\nfoo/' | grep '^foo' 
  records.xml . 
 
 3. Check out the Iota library, https://github.com/thebusby/iota/ . I often 
 use for reducing over 100's of GB's worth of text data. It does what Jozef 
 suggests, and makes a text file a foldable collection.
 
 4. While pmap is great for advertising the power of Clojure, it's likely safe 
 to say that it should be ignored if you're actually looking for performance. 
 
 
 Hope this helps,
 Alan Busby
 
 
 
 
 -- 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-30 Thread Alan Busby

Sorry to jump in, but I thought it worthwhile to add a couple points;
(sorry for being brief)

1. Reducers work fine with data much larger than memory, you just need to
mmap() the data you're working with so Clojure thinks everything is in
memory when it isn't. Reducer access is fairly sequential, not random, so
spinning disks work great here.

2. A 40GB XML file is very often many many smaller XML documents aggregated
together. It's often faster to separate each document into it's own line
(via various UNIX tools) and parse each line separately. I typically do
something like $ zcat bigxml.gz | tr '\n' ' ' | sed 's/foo/\nfoo/' |
grep '^foo'  records.xml .

3. Check out the Iota library, https://github.com/thebusby/iota/ . I often
use for reducing over 100's of GB's worth of text data. It does what Jozef
suggests, and makes a text file a foldable collection.

4. While pmap is great for advertising the power of Clojure, it's likely
safe to say that it should be ignored if you're actually looking for
performance.


Hope this helps,
Alan Busby

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Stuart Halloway

To be clear, I don't object to the approach, only to naming it fold
and/or tying it to interfaces related to folding.

Stu

On Sat, Sep 28, 2013 at 5:29 PM, Paul Butcher p...@paulbutcher.com wrote:

On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote:

Reducers (and fork/join in general) are best suited for fine-grained
computational parallelism on in-memory data. The problem in question
involves processing more data than will fit in memory.

So the question is then what is the best way to parallelize computation
over the stream. There are many ways to do so - pmap is one easy mechanism
to get parallelism over a lazy sequence but the chunking behavior of pmap
can easily be non-optimal for particular use cases. Stu's other solution
(prepare-with-partition-then-fold) reads chunks of data into memory, then
uses f/j over the top of those chunks.

Yes - I'm aware that that's how Stu's solution works. However when I use
that in my example, the performance sucks. As I said in my earlier message,
I believe that the reason for this is the additional copying that it
involves.

To date, I have exactly one implementation that has good parallel
performance - the one that uses foldable-seq (which is 3x faster than the
sequential version). Everything else I've tried performs worse than the
sequential version.

Yet another possible solution is to use a pool of compute threads and to
read from the stream, give those compute tasks to the pool to execute in a
background thread, then reassemble the results at the end (or in a separate
thread). The balance in this approach is to make the tasks the right
chunkiness. Using a buffered queue is one technique to avoid having your
input reader overload the capacity of the processing system.

That's basically *exactly* what the foldable-seq implementation does.

I would also mention that using transients as you build input collections
will be a big win.

I'm sure that that's true - but the foldable-seq implementation is the one
that would gain most from that, and that's the one that already runs 3x
faster than the sequential implementation. But it's also the one that Stu
objects to. What I'm trying to find is an implementation that Stu doesn't
object to, but is faster than the sequential version of the algorithm.

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 22:00, Alex Miller a...@puredanger.com wrote:

Reducers (and fork/join in general) are best suited for fine-grained
computational parallelism on in-memory data. The problem in question
involves processing more data than will fit in memory.

I would also mention that using transients as you build input collections
will be a big win.

Alex

On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:

I would go a bit more further and suggest that you do not use sequences
at all and work only with reducible/foldable collections. Make an input
reader which returns a foldable collection and you will have the most
performant solution. The thing about holding into the head is being worked
on right now, see
http://dev.clojure.org/**jira/browse/CLJ-1250http://dev.clojure.org/jira/browse/CLJ-1250

On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:

On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.com wrote:

I have posted an example that shows partition-then-fold at
https://github.com/**stuarthalloway/exploring-**
clojure/blob/master/examples/**exploring/reducing_apple_pie.**cljhttps://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
.

I would be curious to know how this approach performs with your data.
With the generated data I used, the partition+fold and partition+pmap
approaches both used most of my cores and had similar perf.

Hey Stuart,

Thanks for getting back to me.

I've updated the code for my word

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser

Paul, is there any easy way to get the (small) dataset you're working with, 
so we can run your actual code against the same data?

On Saturday, May 25, 2013 9:34:15 AM UTC-7, Paul Butcher wrote:


 The example counts the words contained within a Wikipedia dump. It should 
 respond well to parallelisation (I have Java and Scala versions that 
 perform excellently) but I've been incapable of coming up with a nice 
 solution in Clojure.



-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Butcher

On 29 Sep 2013, at 22:58, Paul Mooser taron...@gmail.com wrote:

 Paul, is there any easy way to get the (small) dataset you're working with, 
 so we can run your actual code against the same data?

The dataset I'm using is a Wikipedia dump, which hardly counts as small :-)

Having said that, the first couple of million lines is all you need to 
reproduce the results I'm getting, which you can download with:

curl 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | 
bunzip2 | head -n 200  enwiki-short.xml

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Paul Mooser

Thanks - when I said small, I was referring to the fact that your tests 
were using the first 1 pages, as opposed to the entire data dump. Sorry 
if I was unclear or misunderstood. 

On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote:

 The dataset I'm using is a Wikipedia dump, which hardly counts as small 
 :-)


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-29 Thread Brian Craft

On the other hand it is 2013, not 2003. 40G is small in terms of modern 
hardware. Terabyte ram servers have been available for awhile, at prices 
within the reach of many projects. Large data in this decade is measured 
in petabytes, at least.


On Sunday, September 29, 2013 5:13:14 PM UTC-7, Paul Mooser wrote:

 Thanks - when I said small, I was referring to the fact that your tests 
 were using the first 1 pages, as opposed to the entire data dump. Sorry 
 if I was unclear or misunderstood. 

 On Sunday, September 29, 2013 3:20:38 PM UTC-7, Paul Butcher wrote:

 The dataset I'm using is a Wikipedia dump, which hardly counts as small 
 :-)



-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher

On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.

I would be curious to know how this approach performs with your data. With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Hey Stuart,

Thanks for getting back to me.

I've updated the code for my word count example on GitHub with (I believe)
something that works along the lines you suggest:

https://github.com/paulbutcher/parallel-word-count

Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of
these runs counts the words in the first 1 pages of Wikipedia:

$ lein run 1 ~/enwiki.xml sequential
Elapsed time: 23630.443 msecs
$ lein run 1 ~/enwiki.xml fold
Elapsed time: 8697.79 msecs
$ lein run 1 ~/enwiki.xml pmap 1
Elapsed time: 27393.703 msecs
$ lein run 1 ~/enwiki.xml pthenf 1
Elapsed time: 37263.193 msecs

As you can see, the the foldable-seq version gives an almost 3x speedup
relative to the sequential version, and both the partition-then-pmap and
partition-then-fold versions are significantly slower.

The last argument for the partition-then-pmap and partition-then-fold versions
is a partition size. I've tried various different sizes with no material effect:

$ lein run 1 ~/enwiki.xml pthenf 1000
Elapsed time: 43187.385 msecs
$ lein run 1 ~/enwiki.xml pthenf 10
Elapsed time: 35702.651 msecs
$ lein run 1 ~/enwiki.xml pmap 1000
Elapsed time: 34087.314 msecs
$ lein run 1 ~/enwiki.xml pmap 10
Elapsed time: 47340.969 msecs

The performance of the partition-then-pmap version is actually much worse than
the numbers above suggest. There's something very weird going on with (I guess)
garbage collection - it takes a *long* time to finish after printing the
elapsed time and the performance is completely pathological with larger page
counts.

Bottom line: the only way I've been able to obtain any kind of speedup remains
foldable-seq.

I'd be very grateful indeed if you could take a look at how I've implemented
partition-then-fold to make sure that I've correctly captured your intent. Or
if you have any suggestions for anything else that might work, or to explain
the poor performance of partition-then-pmap and partition-then-fold.

My guess is that the problem with partition-then-fold is the copying that's
going on during the (into [] %). I can see that it is operating in parallel
because the number of cores in use goes up, but the net effect is an overall
slowdown rather than a speedup.

That it performs worse than foldable-seq isn't surprising to me, given that it
introduces an unnecessary copy.

I still think that it's a crying shame to disallow folding over sequences - as
the above shows, the gains both in performance and programming ease are
significant, and it would only take a small modification to the reducers API to
fix the holding-onto-head problem. What would be the downside of making this
modification and allowing foldable sequences?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

Hi Paul,

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.

I would be curious to know how this approach performs with your data. With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Enjoying your book!

Stu

On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
I'm currently working on a book on concurrent/parallel development for The
Pragmatic Programmers. One of the subjects I'm covering is parallel
programming in Clojure, but I've hit a roadblock with one of the examples.
I'm hoping that I can get some help to work through it here.

The example counts the words contained within a Wikipedia dump. It should
respond well to parallelisation (I have Java and Scala versions that perform
excellently) but I've been incapable of coming up with a nice solution in
Clojure.

The code I'm working with is checked into GitHub:

The basic sequential algorithm is:

(frequencies (mapcat get-words pages))

If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my
MacBook Pro.

One way to parallelise it is to create a parallel version of frequencies that
uses reducers:

(defn frequencies-fold [words]
(r/fold (partial merge-with +)
(fn

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher

On 28 Sep 2013, at 01:22, Rich Morin r...@cfcl.com wrote:

 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a book on concurrent/parallel development for The 
 Pragmatic Programmers. ...
 
 Ordered; PDF just arrived (:-).

Cool - very interested to hear your feedback once you've had a chance to read 
it.

 I don't know yet whether the book has anything like this, but I'd
 like to see a table that shows which concurrency and parallelism
 approaches are supported (and to what extent) by various languages.
 
 So, actors, agents, queues, reducers (etc) would be on one axis;
 Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
 expect Clojure to have pretty complete coverage; Ruby, not so much.

That might make a good appendix or section in the (as yet unwritten) wrap-up 
chapter - thanks for the suggestion.

And yes, I suspect that you're right that Clojure will fare well in the 
comparison :-)

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 01:22, Rich Morin r...@cfcl.com wrote:

 On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
 I'm currently working on a book on concurrent/parallel development for The 
 Pragmatic Programmers. ...
 
 Ordered; PDF just arrived (:-).
 
 
 I don't know yet whether the book has anything like this, but I'd
 like to see a table that shows which concurrency and parallelism
 approaches are supported (and to what extent) by various languages.
 
 So, actors, agents, queues, reducers (etc) would be on one axis;
 Clojure, Erlang, Go, and Ruby (etc) would be on the other.  I'd
 expect Clojure to have pretty complete coverage; Ruby, not so much.
 
 -r
 
 -- 
 http://www.cfcl.com/rdmRich Morin
 http://www.cfcl.com/rdm/resume r...@cfcl.com
 http://www.cfcl.com/rdm/weblog +1 650-873-7841
 
 Software system design, development, and documentation
 
 
 -- 
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut

I do not know about the most important parts of your performance difficulties,
but on a more trivial point I might be able to shed some light.

See the ClojureDocs page for pmap, which refers to the page for future, linked
below. If you call (shutdown-agents) the 60-second wait to exit should go away.

http://clojuredocs.org/clojure_core/clojure.core/future

Andy

Sent from my iPhone

On Sep 28, 2013, at 1:41 AM, Paul Butcher p...@paulbutcher.com wrote:

On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.

I would be curious to know how this approach performs with your data. With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Hey Stuart,

Thanks for getting back to me.

I've updated the code for my word count example on GitHub with (I believe)
something that works along the lines you suggest:

https://github.com/paulbutcher/parallel-word-count

Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each
of these runs counts the words in the first 1 pages of Wikipedia:

As you can see, the the foldable-seq version gives an almost 3x speedup
relative to the sequential version, and both the partition-then-pmap and
partition-then-fold versions are significantly slower.

The last argument for the partition-then-pmap and partition-then-fold
versions is a partition size. I've tried various different sizes with no
material effect:

The performance of the partition-then-pmap version is actually much worse
than the numbers above suggest. There's something very weird going on with (I
guess) garbage collection - it takes a *long* time to finish after printing
the elapsed time and the performance is completely pathological with larger
page counts.

Bottom line: the only way I've been able to obtain any kind of speedup
remains foldable-seq.

That it performs worse than foldable-seq isn't surprising to me, given that
it introduces an unnecessary copy.

I still think that it's a crying shame to disallow folding over sequences -
as the above shows, the gains both in performance and programming ease are
significant, and it would only take a small modification to the reducers API
to fix the holding-onto-head problem. What would be the downside of making
this modification and allowing foldable sequences?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

Hi Paul,

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.

I would be curious to know how this approach performs with your data. With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Enjoying your book!

Stu

The example counts the words contained within a Wikipedia dump. It should
respond well to parallelisation (I have

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner

I would go a bit more further and suggest that you do not use sequences at
all and work only with reducible/foldable collections. Make an input reader
which returns a foldable collection and you will have the most performant
solution. The thing about holding into the head is being worked on right
now, see http://dev.clojure.org/jira/browse/CLJ-1250

On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:

On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript:
wrote:

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
.

I would be curious to know how this approach performs with your data.
With the generated data I used, the partition+fold and partition+pmap
approaches both used most of my cores and had similar perf.

Hey Stuart,

Thanks for getting back to me.

I've updated the code for my word count example on GitHub with (I believe)
something that works along the lines you suggest:

https://github.com/paulbutcher/parallel-word-count

Here are some sample runs on my machine (a 4-core retina MacBook Pro).
Each of these runs counts the words in the first 1 pages of Wikipedia:

As you can see, the the foldable-seq version gives an almost 3x speedup
relative to the sequential version, and both the partition-then-pmap and
partition-then-fold versions are significantly slower.

The last argument for the partition-then-pmap and partition-then-fold
versions is a partition size. I've tried various different sizes with no
material effect:

The performance of the partition-then-pmap version is actually much worse
than the numbers above suggest. There's something very weird going on with
(I guess) garbage collection - it takes a *long* time to finish after
printing the elapsed time and the performance is completely pathological
with larger page counts.

Bottom line: the only way I've been able to obtain any kind of speedup
remains foldable-seq.

I'd be very grateful indeed if you could take a look at how I've
implemented partition-then-fold to make sure that I've correctly captured
your intent. Or if you have any suggestions for anything else that might
work, or to explain the poor performance of partition-then-pmap and
partition-then-fold.

My guess is that the problem with partition-then-fold is the copying
that's going on during the (into [] %). I can see that it is operating in
parallel because the number of cores in use goes up, but the net effect is
an overall slowdown rather than a speedup.

That it performs worse than foldable-seq isn't surprising to me, given
that it introduces an unnecessary copy.

I still think that it's a crying shame to disallow folding over sequences
- as the above shows, the gains both in performance and programming ease
are significant, and it would only take a small modification to the
reducers API to fix the holding-onto-head problem. What would be the
downside of making this modification and allowing foldable sequences?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com javascript:
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 00:27, Stuart Halloway stuart@gmail.comjavascript:
wrote:

Hi Paul,

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
.

I would be curious to know how this approach performs with your data.
With the generated data I used, the partition+fold and partition+pmap
approaches both used most of my cores and had similar perf.

Enjoying your book!

Stu

On Sat, May 25, 2013 at 12:34 PM, Paul Butcher
pa...@paulbutcher.comjavascript:
wrote:

I'm currently working on a book on concurrent/parallel development for
The Pragmatic Programmers. One of the subjects I'm covering is parallel
programming in Clojure, but I've hit a roadblock with one of the examples.
I'm hoping that I can get some help to work through it here.

The example counts the words contained within a Wikipedia dump. It should
respond well to parallelisation (I have Java

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher

Ah - one mystery down. Thanks Andy!

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 16:26, Andy Fingerhut andy.finger...@gmail.com wrote:

I do not know about the most important parts of your performance
difficulties, but on a more trivial point I might be able to shed some light.

See the ClojureDocs page for pmap, which refers to the page for future,
linked below. If you call (shutdown-agents) the 60-second wait to exit
should go away.

http://clojuredocs.org/clojure_core/clojure.core/future

Andy

Sent from my iPhone

On Sep 28, 2013, at 1:41 AM, Paul Butcher p...@paulbutcher.com wrote:

On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.

I would be curious to know how this approach performs with your data. With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Hey Stuart,

Thanks for getting back to me.

I've updated the code for my word count example on GitHub with (I believe)
something that works along the lines you suggest:

https://github.com/paulbutcher/parallel-word-count

Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each
of these runs counts the words in the first 1 pages of Wikipedia:

As you can see, the the foldable-seq version gives an almost 3x speedup
relative to the sequential version, and both the partition-then-pmap and
partition-then-fold versions are significantly slower.

The last argument for the partition-then-pmap and partition-then-fold
versions is a partition size. I've tried various different sizes with no
material effect:

The performance of the partition-then-pmap version is actually much worse
than the numbers above suggest. There's something very weird going on with
(I guess) garbage collection - it takes a *long* time to finish after
printing the elapsed time and the performance is completely pathological
with larger page counts.

Bottom line: the only way I've been able to obtain any kind of speedup
remains foldable-seq.

I'd be very grateful indeed if you could take a look at how I've implemented
partition-then-fold to make sure that I've correctly captured your intent.
Or if you have any suggestions for anything else that might work, or to
explain the poor performance of partition-then-pmap and partition-then-fold.

That it performs worse than foldable-seq isn't surprising to me, given that
it introduces an unnecessary copy.

I still think that it's a crying shame to disallow folding over sequences -
as the above shows, the gains both in performance and programming ease are
significant, and it would only take a small modification to the reducers API
to fix the holding-onto-head problem. What would be the downside of making
this modification and allowing foldable sequences?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 00:27, Stuart Halloway stuart.hallo...@gmail.com wrote:

Hi Paul,

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.

I would be curious to know how this approach performs with your data. With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Enjoying your book!

Stu

On Sat, May 25, 2013 at 12:34 PM, Paul Butcher p...@paulbutcher.com wrote:
I'm currently working on a

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher

On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote:

 I would go a bit more further and suggest that you do not use sequences at 
 all and work only with reducible/foldable collections. Make an input reader 
 which returns a foldable collection and you will have the most performant 
 solution. The thing about holding into the head is being worked on right now, 
 see http://dev.clojure.org/jira/browse/CLJ-1250


That's fantastic news Jozef - is there any idea when that might be fixed? I can 
see that it's been triaged, but I'm not sure what exactly that means when it 
comes to the Clojure dev process?

Could you expand on exactly what you mean when you say an input reader which 
returns a foldable collection? Bear in mind that the problem I'm trying to 
solve involves a 40GB Wikipedia dump, which clearly can't all be in RAM at one 
time.

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner

I mean that you should forgot about lazy sequences and sequences in
general, if you want to have a cutting edge performance with reducers.
Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does
not hold into the head.

JW



On Sat, Sep 28, 2013 at 6:24 PM, Paul Butcher p...@paulbutcher.com wrote:

 On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote:

 I would go a bit more further and suggest that you do not use sequences at
 all and work only with reducible/foldable collections. Make an input reader
 which returns a foldable collection and you will have the most performant
 solution. The thing about holding into the head is being worked on right
 now, see http://dev.clojure.org/jira/browse/CLJ-1250


 That's fantastic news Jozef - is there any idea when that might be fixed?
 I can see that it's been triaged, but I'm not sure what exactly that means
 when it comes to the Clojure dev process?

 Could you expand on exactly what you mean when you say an input reader
 which returns a foldable collection? Bear in mind that the problem I'm
 trying to solve involves a 40GB Wikipedia dump, which clearly can't all be
 in RAM at one time.

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher

  --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher

On 28 Sep 2013, at 17:42, Jozef Wagner jozef.wag...@gmail.com wrote:

 I mean that you should forgot about lazy sequences and sequences in general, 
 if you want to have a cutting edge performance with reducers. Example of 
 reducible slurp, https://gist.github.com/wagjo/6743885 , does not hold into 
 the head.

OK - I buy that logic. But I'm unsure whether it can be applied to the case 
that I'm interested in.

Firstly, I'm dealing with an XML file, so I need to parse it. Right now I'm 
using data.xml/parse to do so, which returns a lazy sequence. So, short of 
writing my own XML parser, I have to go via a lazy sequence at some point along 
the way?

Secondly, the primary point of this is to exploit parallelism, not reducers 
per-se. So supporting CollReduce isn't enough, I also need to support CollFold. 
Which is exactly what foldable-seq does - I'm not sure how that differs from 
what you're proposing?

Am I missing something?

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner

Well it should be possible to implement a foldseq variant which takes a 
reducible collection as an input. This would speed things, as you don't 
create so much garbage with reducers. XML parser which produces reducible 
collection will be a bit harder :). 

Anyway, I think the bottleneck in your code is at 
https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9
 
Instead of creating new persistent map for each word, you should use a 
transient here.

JW

On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 17:42, Jozef Wagner jozef@gmail.com javascript: 
 wrote:

 I mean that you should forgot about lazy sequences and sequences in 
 general, if you want to have a cutting edge performance with reducers. 
 Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does 
 not hold into the head.


 OK - I buy that logic. But I'm unsure whether it can be applied to the 
 case that I'm interested in.

 Firstly, I'm dealing with an XML file, so I need to parse it. Right now 
 I'm using data.xml/parse to do so, which returns a lazy sequence. So, short 
 of writing my own XML parser, I have to go via a lazy sequence at some 
 point along the way?

 Secondly, the primary point of this is to exploit parallelism, not 
 reducers per-se. So supporting CollReduce isn't enough, I also need to 
 support CollFold. Which is exactly what foldable-seq does - I'm not sure 
 how that differs from what you're proposing?

 Am I missing something?

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: pa...@paulbutcher.com javascript:
 AIM: paulrabutcher
 Skype: paulrabutcher
  


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Jozef Wagner

Or even better, use guava's Multiset there...

On Saturday, September 28, 2013 8:51:56 PM UTC+2, Jozef Wagner wrote:

 Well it should be possible to implement a foldseq variant which takes a 
 reducible collection as an input. This would speed things, as you don't 
 create so much garbage with reducers. XML parser which produces reducible 
 collection will be a bit harder :). 

 Anyway, I think the bottleneck in your code is at 
 https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9Instead
  of creating new persistent map for each word, you should use a 
 transient here.

 JW

 On Saturday, September 28, 2013 7:12:08 PM UTC+2, Paul Butcher wrote:

 On 28 Sep 2013, at 17:42, Jozef Wagner jozef@gmail.com wrote:

 I mean that you should forgot about lazy sequences and sequences in 
 general, if you want to have a cutting edge performance with reducers. 
 Example of reducible slurp, https://gist.github.com/wagjo/6743885 , does 
 not hold into the head.


 OK - I buy that logic. But I'm unsure whether it can be applied to the 
 case that I'm interested in.

 Firstly, I'm dealing with an XML file, so I need to parse it. Right now 
 I'm using data.xml/parse to do so, which returns a lazy sequence. So, short 
 of writing my own XML parser, I have to go via a lazy sequence at some 
 point along the way?

 Secondly, the primary point of this is to exploit parallelism, not 
 reducers per-se. So supporting CollReduce isn't enough, I also need to 
 support CollFold. Which is exactly what foldable-seq does - I'm not sure 
 how that differs from what you're proposing?

 Am I missing something?

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: pa...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher
  


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Andy Fingerhut

If a Clojure ticket is triaged, it means that one of the Clojure screeners
believe the ticket's description describes a real issue with Clojure that
ought to be changed in some way, and would like Rich Hickey to look at it
and see whether he agress.  If he does, it becomes vetted.  A diagram of
the process that Clojure tickets go through is shown at the link below (you
will have to scroll down a bit to see it):

http://dev.clojure.org/display/community/JIRA+workflow

If you are curious to know about changes in any particular ticket's state
as they occur, create an account on JIRA [1] and click the Watch link
when viewing that ticket.  You will be emailed about changes in its state.

The soonest that any ticket might become part of a modified version of
Clojure released as a JAR file is when Clojure 1.6.0 is released.  If
forced to guess, I would put a small bet on that happening in first half of
2014 some time, but I have no actual knowledge from those deciding when
that will happen.  In the past it seems to have been based more upon when
certain features were added than on a particular amount of time passing.
Until the release, there is no promise that any particular ticket will or
will not become part of the release.

[1] http://dev.clojure.org/jira/secure/Signup!default.jspa

Andy



On Sat, Sep 28, 2013 at 9:24 AM, Paul Butcher p...@paulbutcher.com wrote:

 On 28 Sep 2013, at 17:14, Jozef Wagner jozef.wag...@gmail.com wrote:

 I would go a bit more further and suggest that you do not use sequences at
 all and work only with reducible/foldable collections. Make an input reader
 which returns a foldable collection and you will have the most performant
 solution. The thing about holding into the head is being worked on right
 now, see http://dev.clojure.org/jira/browse/CLJ-1250


 That's fantastic news Jozef - is there any idea when that might be fixed?
 I can see that it's been triaged, but I'm not sure what exactly that means
 when it comes to the Clojure dev process?

 Could you expand on exactly what you mean when you say an input reader
 which returns a foldable collection? Bear in mind that the problem I'm
 trying to solve involves a 40GB Wikipedia dump, which clearly can't all be
 in RAM at one time.

 --
 paul.butcher-msgCount++

 Snetterton, Castle Combe, Cadwell Park...
 Who says I have a one track mind?

 http://www.paulbutcher.com/
 LinkedIn: http://www.linkedin.com/in/paulbutcher
 MSN: p...@paulbutcher.com
 AIM: paulrabutcher
 Skype: paulrabutcher

  --
 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with
 your first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 ---
 You received this message because you are subscribed to the Google Groups
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to clojure+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.


-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Paul Butcher

On 28 Sep 2013, at 19:51, Jozef Wagner jozef.wag...@gmail.com wrote:

 Anyway, I think the bottleneck in your code is at 
 https://github.com/paulbutcher/parallel-word-count/blob/master/src/wordcount/core.clj#L9
  Instead of creating new persistent map for each word, you should use a 
 transient here.

I would love it if that were true. However the code you reference is from the 
algorithm that uses foldable-seq and runs quickly (i.e. not the one that has 
the performance problem).

--
paul.butcher-msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

2013-09-28 Thread Alex Miller

For your timings, I would also strongly recommend altering your project.clj
to force the -server hotspot:

:jvm-opts ^:replace [-Xmx1g -server ... and whatever else you want
here ... ]

By default lein will use tiered compilation to optimize repl startup, which
is not what you want for timing.

And I second Andy's (shutdown-agents) advice.

On Saturday, September 28, 2013 3:41:20 AM UTC-5, Paul Butcher wrote: