Hi Paul, I have posted an example that shows partition-then-fold at https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj .
I would be curious to know how this approach performs with your data. With the generated data I used, the partition+fold and partition+pmap approaches both used most of my cores and had similar perf. Enjoying your book! Stu On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <p...@paulbutcher.com> wrote: > I'm currently working on a book on concurrent/parallel development for The > Pragmatic Programmers. One of the subjects I'm covering is parallel > programming in Clojure, but I've hit a roadblock with one of the examples. > I'm hoping that I can get some help to work through it here. > > The example counts the words contained within a Wikipedia dump. It should > respond well to parallelisation (I have Java and Scala versions that > perform excellently) but I've been incapable of coming up with a nice > solution in Clojure. > > The code I'm working with is checked into > GitHub<https://github.com/paulbutcher/parallel-word-count> > : > > The basic sequential algorithm is: > > (frequencies (mapcat get-words pages)) > > > If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on > my MacBook Pro. > > One way to parallelise it is to create a parallel version of frequencies > that uses reducers: > > (defn frequencies-fold [words] > (r/fold (partial merge-with +) > (fn [counts word] (assoc counts word (inc (get counts word 0)))) > > words)) > > > And sure enough, if I use that, along with use the > foldable-seq<https://github.com/paulbutcher/parallel-word-count/blob/master/src/foldable_seq/core.clj> > utility > I posted about > here<https://groups.google.com/d/msg/clojure/8RKCjF00ukQ/b5mmmOB5Uh4J> are > while ago it runs in ~8s, almost a 3x speedup, not bad given that the > parallel version is unable to use transients. > > Unfortunately, as they currently stand, reducers have a fatal flaw that > means that, even with foldable-seq, they're basically useless with lazy > sequences. Reducers always hold onto the > head<https://groups.google.com/d/msg/clojure-dev/qJo7z_9CVdw/ARnHe1bThuMJ> of > the sequence they're given, so there's no way to use this approach for a > complete Wikipedia dump (which runs to around 40GiB). > > So the other approach I've tried is to use pmap: > > (defn frequencies-pmap [words] > (reduce (partial merge-with +) > (pmap frequencies > (partition-all 10000 words)))) > > > But, for reasons I don't understand, this performs dreadfully - taking > ~26s, i.e. significantly slower than the sequential version. > > I've tried playing with different partition sizes without materially > affecting the result. > > So, what I'm looking for is either: > > a) An explanation for why the pmap-based solution performs so badly > > b) A way to fix the "holding onto head" problem that's inherent within > reducers. > > With the last of these in mind, it strikes me that the problem > fundamentally arises from the desire for reducers to follow the same basic > API as "normal" code. So: > > (reduce (filter ... (map ... coll))) > > becomes: > > (r/fold (r/filter ... (r/map ... coll))) > > A very small change to the reducers API - passing the collection to the > reduce and/or fold - would avoid the problem: > > (r/fold (r/filter ... (r/map ...)) coll) > > Anyway - I'd be very grateful for any insight into either of the above > questions. Or for suggestions for an alternative approach that might be > more fruitful. > > Many thanks in advance, > > -- > paul.butcher->msgCount++ > > Snetterton, Castle Combe, Cadwell Park... > Who says I have a one track mind? > > http://www.paulbutcher.com/ > LinkedIn: http://www.linkedin.com/in/paulbutcher > MSN: p...@paulbutcher.com > AIM: paulrabutcher > Skype: paulrabutcher > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.