I've been experimenting with reducers using a small example that counts the 
words in Wikipedia pages by parsing the Wikipedia XML dump. The basic structure 
of the code is:

(frequencies (flatten (map get-words (get-pages))))

where get-pages returns a lazy sequence of pages from the XML dump and 
get-words takes a page and returns a sequence of the words on that page. The 
above code takes ~40s to count the words on the first 10000 pages.

If I convert that code to use reducers, it runs in ~22s (yay!).

If I convert it to use fold and therefore run in parallel, it runs in ~13s on 
my 4-core MacBook Pro. So it's faster (yay!) but nowhere near 4x faster (boo).

The primary reason for this is that, in order to be able to use fold, I've had 
to write my own version of frequencies:

(defn frequencies-parallel [words]
  (r/fold (partial merge-with +)
          (fn [counts x] (assoc counts x (inc (get counts x 0))))
          words))

And, unlike the version in core, this doesn't use transients. If I replace the 
fold with reduce (i.e. make it run sequentially) it runs in ~43s.

So, I *am* getting close to a 4x speedup from parallelising the code, but 
unfortunately I'm also seeing a 2x slowdown because I can't use transients.

Can anyone think of any way that it would be possible to modify this code to 
use transients? Or any way to modify reducers to allow transients to be used?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: p...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to