especially at scale!
And we are testing on >1000 node clusters with long jobs. We see
lots of failures per job.
On Aug 24, 2007, at 4:20 PM, Ted Dunning wrote:
On 8/24/07 12:11 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
>
>> Using the same logic, streaming reduce outputs to
>> the next map and reduce steps (before the first reduce is complete)
>> should also provide speedup.
>
> Perhaps, but the bookeeping required in the jobtracker might be
onerous.
> The failure modes are more complex, complicating recovery.
Frankly, I find Doug's arguments about reliability fairly compelling.
Map-reduce^n is not the same, nor is it entirely analogous to pipe-
style
programming. It feels the same, but there are very important
differences
that I wasn't thinking of when I made this suggestion. The most
important
is the issue of reliability. In a large cluster, failure is a
continuous
process, not an isolated event. As such, the problems of having to
roll
back an entire program due to node failure are not something that
can be
treated as unusual. That makes Doug's comments about risk more on-
point
than the potential gains. It is very easy to imagine scenarios
where the
possibility of program roll-back results in very large average run-
times
while the chained reduce results in only incremental savings. This
isn't a
good thing to bet on.