Hi Jake,
> Are you saying you want something more sophisticated than just setting
> your number of reducers equal to zero and then repeatedly running your
> Map(minus Reduce) job on Hadoop? The Mappers will go where the
> data is, as you say, but if your mapper output then needs to be collected
> in some place and aggregated, you'll need the shuffle + Reduce step.
Yes, I'd like to be more sophisticated than that. Assume that the
output of each mapper is 512MB of doubles. Then, writing these to hdfs
and shuffing, reducing & re-reading them in the next pass easily
dominates the overall runtime of the algorithm. And that's what I'd
like to avoid. Current ("local") solutions are usually limited by the
network bandwidth, and hadoop offers some relief on that.
In a way, I want a sequential program scheduled through hadoop. I will
loose the parallelism, but I want to keep data locality, scheduling
and restart-on-failure.
Thanks,
Markus