Parallelising over a lazy sequence - request for help

Paul Butcher Sat, 25 May 2013 09:34:23 -0700

I'm currently working on a book on concurrent/parallel development for The 
Pragmatic Programmers. One of the subjects I'm covering is parallel programming 
in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that 
I can get some help to work through it here.


The example counts the words contained within a Wikipedia dump. It should 
respond well to parallelisation (I have Java and Scala versions that perform 
excellently) but I've been incapable of coming up with a nice solution in 
Clojure.

The code I'm working with is checked into GitHub: 

The basic sequential algorithm is:

> (frequencies (mapcat get-words pages))


If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my 
MacBook Pro.

One way to parallelise it is to create a parallel version of frequencies that 
uses reducers:

> (defn frequencies-fold [words]
>   (r/fold (partial merge-with +)
>           (fn [counts word] (assoc counts word (inc (get counts word 0))))
>           words))

And sure enough, if I use that, along with use the foldable-seq utility I 
posted about here are while ago it runs in ~8s, almost a 3x speedup, not bad 
given that the parallel version is unable to use transients.

Unfortunately, as they currently stand, reducers have a fatal flaw that means 
that, even with foldable-seq, they're basically useless with lazy sequences. 
Reducers always hold onto the head of the sequence they're given, so there's no 
way to use this approach for a complete Wikipedia dump (which runs to around 
40GiB).

So the other approach I've tried is to use pmap:

> (defn frequencies-pmap [words]
>   (reduce (partial merge-with +) 
>     (pmap frequencies 
>       (partition-all 10000 words))))

But, for reasons I don't understand, this performs dreadfully - taking ~26s, 
i.e. significantly slower than the sequential version.

I've tried playing with different partition sizes without materially affecting 
the result.

So, what I'm looking for is either:

a) An explanation for why the pmap-based solution performs so badly

b) A way to fix the "holding onto head" problem that's inherent within reducers.

With the last of these in mind, it strikes me that the problem fundamentally 
arises from the desire for reducers to follow the same basic API as "normal" 
code. So:

(reduce (filter ... (map ... coll)))

becomes:

(r/fold (r/filter ... (r/map ... coll)))

A very small change to the reducers API - passing the collection to the reduce 
and/or fold - would avoid the problem:

(r/fold (r/filter ... (r/map ...)) coll)

Anyway - I'd be very grateful for any insight into either of the above 
questions. Or for suggestions for an alternative approach that might be more 
fruitful.

Many thanks in advance,

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Parallelising over a lazy sequence - request for help

Reply via email to