Re: Parallelising over a lazy sequence - request for help

Stuart Halloway Fri, 27 Sep 2013 16:28:25 -0700

Hi Paul,

I have posted an example that shows partition-then-fold at
https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj
.


I would be curious to know how this approach performs with your data.  With
the generated data I used, the partition+fold and partition+pmap approaches
both used most of my cores and had similar perf.

Enjoying your book!

Stu


On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <p...@paulbutcher.com> wrote:

> I'm currently working on a book on concurrent/parallel development for The
> Pragmatic Programmers. One of the subjects I'm covering is parallel
> programming in Clojure, but I've hit a roadblock with one of the examples.
> I'm hoping that I can get some help to work through it here.
>
> The example counts the words contained within a Wikipedia dump. It should
> respond well to parallelisation (I have Java and Scala versions that
> perform excellently) but I've been incapable of coming up with a nice
> solution in Clojure.
>
> The code I'm working with is checked into 
> GitHub<https://github.com/paulbutcher/parallel-word-count>
> :
>
> The basic sequential algorithm is:
>
> (frequencies (mapcat get-words pages))
>
>
> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on
> my MacBook Pro.
>
> One way to parallelise it is to create a parallel version of frequencies
> that uses reducers:
>
> (defn frequencies-fold [words]
>   (r/fold (partial merge-with +)
>           (fn [counts word] (assoc counts word (inc (get counts word 0))))
>
>           words))
>
>
> And sure enough, if I use that, along with use the 
> foldable-seq<https://github.com/paulbutcher/parallel-word-count/blob/master/src/foldable_seq/core.clj>
>  utility
> I posted about 
> here<https://groups.google.com/d/msg/clojure/8RKCjF00ukQ/b5mmmOB5Uh4J> are
> while ago it runs in ~8s, almost a 3x speedup, not bad given that the
> parallel version is unable to use transients.
>
> Unfortunately, as they currently stand, reducers have a fatal flaw that
> means that, even with foldable-seq, they're basically useless with lazy
> sequences. Reducers always hold onto the 
> head<https://groups.google.com/d/msg/clojure-dev/qJo7z_9CVdw/ARnHe1bThuMJ> of
> the sequence they're given, so there's no way to use this approach for a
> complete Wikipedia dump (which runs to around 40GiB).
>
> So the other approach I've tried is to use pmap:
>
> (defn frequencies-pmap [words]
>   (reduce (partial merge-with +)
>     (pmap frequencies
>       (partition-all 10000 words))))
>
>
> But, for reasons I don't understand, this performs dreadfully - taking
> ~26s, i.e. significantly slower than the sequential version.
>
> I've tried playing with different partition sizes without materially
> affecting the result.
>
> So, what I'm looking for is either:
>
> a) An explanation for why the pmap-based solution performs so badly
>
> b) A way to fix the "holding onto head" problem that's inherent within
> reducers.
>
> With the last of these in mind, it strikes me that the problem
> fundamentally arises from the desire for reducers to follow the same basic
> API as "normal" code. So:
>
> (reduce (filter ... (map ... coll)))
>
> becomes:
>
> (r/fold (r/filter ... (r/map ... coll)))
>
> A very small change to the reducers API - passing the collection to the
> reduce and/or fold - would avoid the problem:
>
> (r/fold (r/filter ... (r/map ...)) coll)
>
> Anyway - I'd be very grateful for any insight into either of the above
> questions. Or for suggestions for an alternative approach that might be
> more fruitful.
>
> Many thanks in advance,
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: p...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

Reply via email to