especially at scale!

And we are testing on >1000 node clusters with long jobs. We see lots of failures per job.

On Aug 24, 2007, at 4:20 PM, Ted Dunning wrote:




On 8/24/07 12:11 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

>
>> Using the same logic, streaming reduce outputs to
>> the next map and reduce steps (before the first reduce is complete)
>> should also provide speedup.
>
> Perhaps, but the bookeeping required in the jobtracker might be onerous.
>   The failure modes are more complex, complicating recovery.

Frankly, I find Doug's arguments about reliability fairly compelling.

Map-reduce^n is not the same, nor is it entirely analogous to pipe- style programming. It feels the same, but there are very important differences that I wasn't thinking of when I made this suggestion. The most important is the issue of reliability. In a large cluster, failure is a continuous process, not an isolated event. As such, the problems of having to roll back an entire program due to node failure are not something that can be treated as unusual. That makes Doug's comments about risk more on- point than the potential gains. It is very easy to imagine scenarios where the possibility of program roll-back results in very large average run- times while the chained reduce results in only incremental savings. This isn't a
good thing to bet on.



Reply via email to