On Thu, Jan 28, 2010 at 11:30 AM, Markus Weimer <[email protected]> wrote:
> Hi Jake,
>
> > Are you saying you want something more sophisticated than just setting
> > your number of reducers equal to zero and then repeatedly running your
> > Map(minus Reduce) job on Hadoop? The Mappers will go where the
> > data is, as you say, but if your mapper output then needs to be collected
> > in some place and aggregated, you'll need the shuffle + Reduce step.
>
> Yes, I'd like to be more sophisticated than that. Assume that the
> output of each mapper is 512MB of doubles. Then, writing these to hdfs
> and shuffing, reducing & re-reading them in the next pass easily
> dominates the overall runtime of the algorithm.
So trying to think more specifically - your output from the mapper - is
this
0.5GB set of doubles needed by any other mapper in subsequent iterations
of your algorithm? If it's only needed by the data local to the current
node,
you can do as Ted says and write to local disk, reading it back up on the
next pass.
> And that's what I'd
> like to avoid. Current ("local") solutions are usually limited by the
> network bandwidth, and hadoop offers some relief on that.
>
How does network bandwidth come into play in a "local" solution?
> In a way, I want a sequential program scheduled through hadoop. I will
> loose the parallelism, but I want to keep data locality, scheduling
> and restart-on-failure.
You're still doing things partially parallelized, right? Because your
input
data set is large enough to need to be split between machines, and your
algorithm can work on each chunk independently? Or is this not
the case?
-jake