Consider the following:

You have 10 million data points and you need to apply a multipass algorithm to 
them. Each pass is like a cellular
automata: it can read from the previous pass but it doesn't know the "current" 
values. This makes the actual
processing of each value trivially parallelizable. The actual operation for 
each value is fairly simple and cheap
(essentially a multidimensional ancestor-child averaging operation).

After each value has been operated on once, the pass is complete and the 
"current" and "old" buffers are switched
(conceptually, the "current" buffer can only be written to, the "old" buffer 
can only be read--using __gshared
here).

The number of passes is not fixed; in the course of each value operation, an 
error is computed. When the worst
individual error falls below a certain threshold, the algorithm is finished. 
Generally this will take between one
thousand and ten thousand passes.

How would you go about parallelizing this? My thought is to take the map/reduce 
approach within each pass: each
thread/fiber takes a slice of the dataset, makes its modifications, then 
returns an error summary. These summaries
are quickly combined and the algorithm loop decides whether to run again. Each 
pass shouldn't take more than a
second or two, so I'm not sure whether introducing the overhead of spawning, 
say, 10 threads each pass is
worthwhile (times 5000 passes). On the other hand, I have plenty of CPUs to 
throw at it (at least 16 cores, each
with hyperthreading) and am in a situation where "as fast as possible" is 
important (while individual datasets may
not grow, the number of them is).

Any thoughts appreciated.

Justin

Reply via email to