Re: Sequential vs. "divide and conquer" algorithm

Michał Marczyk Sat, 20 Mar 2010 01:43:52 -0700

On 19 March 2010 17:53, Andrzej <ndrwr...@googlemail.com> wrote:
> I've been toying with various implementations of reduce-like
> functions, trying to do something "smarter" than a simple iteration
> over a collection of data. This hasn't worked out very well, my
> implementation is a lot (~50x) slower than a simple loop. Could anyone
> point out any bottlenecks in this algorithm and advise me on possible
> ways of improving it?


Well, you're calling subvec about twice as many times (I guess) as
there are elements in your input vector. Then you're calling count at
least once for each of the intermediate vectors (and twice for those
which will be split further). Even with O(1) operations like those,
that's certainly going to slow you down significantly if you don't
derive a big concurrency-related benefit from it.

Ultimately, if what you're trying to do is to perform a reduce-like
operation in a non-sequential fashion so that it may be better
parallelised (taking advantage of the fact that you're dealing with a
commutative operation), perhaps it would be best to split the input
sequence into, say, twice as many chunks as you have execution
contexts (please tweak this figure, of course), then run a simple
reduce on each chunk in separation, with a final combining step at the
end? Even with an approach similar to your first take above, perhaps
the case of short vectors (< 10 elements, say, or whatever seems to
yield the best results) could be handed off to reduce.

An example implementation:

(defn chunked-commutative-vector-reduce [n f v]
  (let [cnt (count v)
        split-points (concat (range 0 cnt (/ cnt n)) [cnt])
        subvs (map #(subvec v %1 %2) split-points (rest split-points))]
    (reduce f (pmap #(reduce f %) subvs))))

Some simplistic benchmarking. For simple addition this seems to be
somewhat slower than regular reduce:

user> (dotimes [_ 5] (time (reduce + (vec (range 100000)))))
"Elapsed time: 27.575505 msecs"
"Elapsed time: 25.066665 msecs"
"Elapsed time: 39.06125 msecs"
"Elapsed time: 25.237215 msecs"
"Elapsed time: 24.404847 msecs"

user> (dotimes [_ 5] (time (chunked-commutative-vector-reduce 4 + (vec
(range 100000)))))
"Elapsed time: 44.034438 msecs"
"Elapsed time: 45.27503 msecs"
"Elapsed time: 50.472547 msecs"
"Elapsed time: 51.770968 msecs"
"Elapsed time: 51.145611 msecs"

With my convoluted-op the chunked version is somewhat faster (though
some particularly slow runs do happen occasionally for whatever
reason):

(defn convoluted-op [x y]
  (inc (Long. (Math/round (Math/pow (rem (* x x y y) (+ x y 1)) 8)))))

user> (chunked-commutative-vector-reduce 4 convoluted-op (vec (range 100000)))
9223372036854775808
user> (reduce convoluted-op (vec (range 100000)))
9223372036854775808
user> (dotimes [_ 5] (time (reduce convoluted-op (vec (range 100000)))))
"Elapsed time: 459.935904 msecs"
"Elapsed time: 489.130463 msecs"
"Elapsed time: 489.130463 msecs"
"Elapsed time: 479.653072 msecs"
"Elapsed time: 436.05228 msecs"
"Elapsed time: 431.632579 msecs"
nil
user> (dotimes [_ 5] (time (chunked-commutative-vector-reduce 4
convoluted-op (vec (range 100000)))))
"Elapsed time: 332.564412 msecs"
"Elapsed time: 279.848551 msecs"
"Elapsed time: 312.854074 msecs"
"Elapsed time: 301.574794 msecs"
"Elapsed time: 276.898379 msecs"
nil

With + changed to convoluted-op, your sum_tree function takes just
under 3,5 s to complete the same calculation on my box.

Also, the chunked reduce function basically degenerates into the
standard reduce (with minor complications) when called with an initial
argument of 1, so the single-threaded runtime is pretty much the same.

That's just my initial take... Probably far from the best that one
could do. Ideally the number-of-chunks argument would not be necessary
at all -- I think there's a system property available on the JVM from
which one can obtain the number of available processors and perhaps
some day we'll be able to use Grand Central Dispatch or something for
things like this. (But note that benchmarking with more work-intensive
functions inside the reduce is important in any case.)

Oh, one more thing. The presentation by Guy Steele at ICFP 2009,
"Organizing Functional Code for Parallel Execution; or, foldl and
foldr Considered Slightly Harmful", is interesting in this context;
links to the presentation video as well as the slides deck are
available here:

http://lambda-the-ultimate.org/node/3616

By the way, this gave a me an idea which I might be able to put to
good use, so -- thanks! :-)

Sincerely,
Michał

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

To unsubscribe from this group, send email to 
clojure+unsubscribegooglegroups.com or reply to this email with the words 
"REMOVE ME" as the subject.

Re: Sequential vs. "divide and conquer" algorithm

Reply via email to