Re: Parallelising over a lazy sequence - request for help

Paul Butcher Sat, 28 Sep 2013 14:29:46 -0700

On 28 Sep 2013, at 22:00, Alex Miller <a...@puredanger.com> wrote:

> Reducers (and fork/join in general) are best suited for fine-grained 
> computational parallelism on in-memory data. The problem in question involves 
> processing more data than will fit in memory.
> 
> So the question is then what is the best way to parallelize computation over 
> the stream. There are many ways to do so - pmap is one easy mechanism to get 
> parallelism over a lazy sequence but the chunking behavior of pmap can easily 
> be non-optimal for particular use cases. Stu's other solution 
> (prepare-with-partition-then-fold) reads chunks of data into memory, then 
> uses f/j over the top of those chunks.


Yes - I'm aware that that's how Stu's solution works. However when I use that 
in my example, the performance sucks. As I said in my earlier message, I 
believe that the reason for this is the additional copying that it involves.

To date, I have exactly one implementation that has good parallel performance - 
the one that uses foldable-seq (which is 3x faster than the sequential 
version). Everything else I've tried performs worse than the sequential version.

> Yet another possible solution is to use a pool of compute threads and to read 
> from the stream, give those compute tasks to the pool to execute in a 
> background thread, then reassemble the results at the end (or in a separate 
> thread). The balance in this approach is to make the tasks the right 
> "chunkiness". Using a buffered queue is one technique to avoid having your 
> input reader overload the capacity of the processing system.

That's basically *exactly* what the foldable-seq implementation does.

> I would also mention that using transients as you build input collections 
> will be a big win. 

I'm sure that that's true - but the foldable-seq implementation is the one that 
would gain most from that, and that's the one that already runs 3x faster than 
the sequential implementation. But it's also the one that Stu objects to. What 
I'm trying to find is an implementation that Stu doesn't object to, but is 
faster than the sequential version of the algorithm.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 22:00, Alex Miller <a...@puredanger.com> wrote:

> Reducers (and fork/join in general) are best suited for fine-grained 
> computational parallelism on in-memory data. The problem in question involves 
> processing more data than will fit in memory.
> 
> So the question is then what is the best way to parallelize computation over 
> the stream. There are many ways to do so - pmap is one easy mechanism to get 
> parallelism over a lazy sequence but the chunking behavior of pmap can easily 
> be non-optimal for particular use cases. Stu's other solution 
> (prepare-with-partition-then-fold) reads chunks of data into memory, then 
> uses f/j over the top of those chunks. 
> 
> Yet another possible solution is to use a pool of compute threads and to read 
> from the stream, give those compute tasks to the pool to execute in a 
> background thread, then reassemble the results at the end (or in a separate 
> thread). The balance in this approach is to make the tasks the right 
> "chunkiness". Using a buffered queue is one technique to avoid having your 
> input reader overload the capacity of the processing system.
> 
> I would also mention that using transients as you build input collections 
> will be a big win. 
> 
> Alex
> 
> On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
> I would go a bit more further and suggest that you do not use sequences at 
> all and work only with reducible/foldable collections. Make an input reader 
> which returns a foldable collection and you will have the most performant 
> solution. The thing about holding into the head is being worked on right now, 
> see http://dev.clojure.org/jira/browse/CLJ-1250
> 
> JW
> 
> On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
> On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com> wrote:
> 
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>> 
>> I would be curious to know how this approach performs with your data.  With 
>> the generated data I used, the partition+fold and partition+pmap approaches 
>> both used most of my cores and had similar perf.
> 
> Hey Stuart, 
> 
> Thanks for getting back to me.
> 
> I've updated the code for my word count example on GitHub with (I believe) 
> something that works along the lines you suggest:
> 
> https://github.com/paulbutcher/parallel-word-count
> 
> Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each 
> of these runs counts the words in the first 10000 pages of Wikipedia:
> 
>> $ lein run 10000 ~/enwiki.xml sequential
>> "Elapsed time: 23630.443 msecs"
>> $ lein run 10000 ~/enwiki.xml fold
>> "Elapsed time: 8697.79 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 10000
>> "Elapsed time: 27393.703 msecs"
>> $ lein run 10000 ~/enwiki.xml pthenf 10000
>> "Elapsed time: 37263.193 msecs"
> 
> As you can see, the the foldable-seq version gives an almost 3x speedup 
> relative to the sequential version, and both the partition-then-pmap and 
> partition-then-fold versions are significantly slower.
> 
> The last argument for the partition-then-pmap and partition-then-fold 
> versions is a partition size. I've tried various different sizes with no 
> material effect:
> 
>> $ lein run 10000 ~/enwiki.xml pthenf 1000
>> "Elapsed time: 43187.385 msecs"
>> $ lein run 10000 ~/enwiki.xml pthenf 100000
>> "Elapsed time: 35702.651 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 1000
>> "Elapsed time: 34087.314 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 100000
>> "Elapsed time: 47340.969 msecs"
> 
> The performance of the partition-then-pmap version is actually much worse 
> than the numbers above suggest. There's something very weird going on with (I 
> guess) garbage collection - it takes a *long* time to finish after printing 
> the elapsed time and the performance is completely pathological with larger 
> page counts.
> 
> Bottom line: the only way I've been able to obtain any kind of speedup 
> remains foldable-seq.
> 
> I'd be very grateful indeed if you could take a look at how I've implemented 
> partition-then-fold to make sure that I've correctly captured your intent. Or 
> if you have any suggestions for anything else that might work, or to explain 
> the poor performance of partition-then-pmap and partition-then-fold.
> 
> My guess is that the problem with partition-then-fold is the copying that's 
> going on during the (into [] %). I can see that it is operating in parallel 
> because the number of cores in use goes up, but the net effect is an overall 
> slowdown rather than a speedup.
> 
> That it performs worse than foldable-seq isn't surprising to me, given that 
> it introduces an unnecessary copy.
> 
> I still think that it's a crying shame to disallow folding over sequences - 
> as the above shows, the gains both in performance and programming ease are 
> significant, and it would only take a small modification to the reducers API 
> to fix the holding-onto-head problem. What would be the downside of making 
> this modification and allowing foldable sequences?
> 
> --
> paul.butcher->msgCount++
> 
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
> 
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: pa...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
> 
> On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com> wrote:
> 
>> Hi Paul,
>> 
>> I have posted an example that shows partition-then-fold at 
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>> 
>> I would be curious to know how this approach performs with your data.  With 
>> the generated data I used, the partition+fold and partition+pmap approaches 
>> both used most of my cores and had similar perf.
>> 
>> Enjoying your book!
>> 
>> Stu
>> 
>> 
>> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <pa...@paulbutcher.com> wrote:
>> I'm currently working on a book on concurrent/parallel development for The 
>> Pragmatic Programmers. One of the subjects I'm covering is parallel 
>> programming in Clojure, but I've hit a roadblock with one of the examples. 
>> I'm hoping that I can get some help to work through it here.
>> 
>> The example counts the words contained within a Wikipedia dump. It should 
>> respond well to parallelisation (I have Java and Scala versions that perform 
>> excellently) but I've been incapable of coming up with a nice solution in 
>> Clojure.
>> 
>> The code I'm working with is checked into GitHub: 
>> 
>> The basic sequential algorithm is:
>> 
>>> (frequencies (mapcat get-words pages))
>> 
>> 
>> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my 
>> MacBook Pro.
>> 
>> One way to parallelise it is to create a parallel version of frequencies 
>> that uses reducers:
>> 
>>> (defn frequencies-fold [words]
>>>   (r/fold (partial merge-with +)
>>>           (fn [counts word] (assoc counts word (inc (get counts word 0))))
>>> 
>>>           words))
>> 
>> And sure enough, if I use that, along with use the foldable-seq utility I 
>> posted about here are while ago it runs in ~8s, almost a 3x speedup, not bad 
>> given that the parallel version is unable to use transients.
>> 
>> Unfortunately, as they currently stand, reducers have a fatal flaw that 
>> means that, even with foldable-seq, they're basically useless with lazy 
>> sequences. Reducers always hold onto the head of the sequence they're given, 
>> so there's no way to use this approach for a complete Wikipedia dump (which 
>> runs to around 40GiB).
>> 
>> So the other approach I've tried is to use pmap:
>> 
>>> (defn frequencies-pmap [words]
>>> 
>>>   (reduce (partial merge-with +) 
>>> 
>>>     (pmap frequencies 
>>> 
>>>       (partition-all 10000 words))))
>>> 
>> 
>> But, for reasons I don't understand, this performs dreadfully - taking ~26s, 
>> i.e. significantly slower than the sequential version.
>> 
>> I've tried playing with different partition sizes without materially 
>> affecting the result.
>> 
>> So, what I'm looking for is either:
>> 
>> a) An explanation for why the pmap-based solution performs so badly
>> 
>> b) A way to fix the "holding onto head" problem that's inherent within 
>> reducers.
>> 
>> With the last of these in mind, it strikes me that the problem fundamentally 
>> arises from the desire for reducers to follow the same basic API as "normal" 
>> code. So:
>> 
>> (reduce (filter ... (map ... coll)))
>> 
>> becomes:
>> 
>> (r/fold (r/filter ... (r/map ... coll)))
>> 
>> A very small change to the reducers API - passing the collection to the 
>> reduce and/or fold - would avoid the problem:
>> 
>> (r/fold (r/filter ... (r/map ...)) coll)
>> 
>> Anyway - I'd be very grateful for any insight into either of the above 
>> questions. Or for suggestions for an alternative approach that might be more 
>> fruitful.
>> 
>> Many thanks in advance,
>> 
>> --
>> paul.butcher->msgCount++
>> 
>> Snetterton, Castle Combe, Cadwell Park...
>> Who says I have a one track mind?
>> 
>> http://www.paulbutcher.com/
>> LinkedIn: http://www.linkedin.com/in/paulbutcher
>> MSN: pa...@paulbutcher.com
>> AIM: paulrabutcher
>> Skype: paulrabutcher
>> 
>> 
>> -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com
>> Note that posts from new members are moderated - please be patient with your 
>> first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>  
>>  
>> 
>> 
>> -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com
>> Note that posts from new members are moderated - please be patient with your 
>> first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

Reply via email to