Stating the obvious - the issue is starting the next mapper using outputs from a reduce that is not yet complete. Currently the reduce succeeds all or nothing and partial output is not available. If this behavior is changed one has to make sure:
- partial reduce outputs are cleaned up in case of eventual failure of reduce job (and I guess kill the next map job as well) - running the next map job in a mode where it's input files arrive over time (and are not statically fixed at launch time). which don't seem like a small change. Even without such optimizations - would love it if one could submit a dependency graph of jobs to the job tracker (instead of arranging it from outside). (But i digress). -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Dyer Sent: Tuesday, November 06, 2007 5:05 PM To: [email protected] Subject: Re: performance of multiple map-reduce operations On 11/6/07, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Joydeep Sen Sarma wrote: > > One of the controversies is whether in the presence of failures, this > > makes performance worse rather than better (kind of like udp vs. tcp - > > what's better depends on error rate). The probability of a failure per > > job will increase non-linearly as the number of nodes involved per job > > increases. So what might make sense for small clusters may not make > > sense for bigger ones. But it sure would be nice to have this option. > > Hmm. Personally I wouldn't put a very high priority on complicated > features that don't scale well. I don't think this is necessarily that what we're asking for is complicated, and I think it can be made to scale quite well. The critique is appropriate for the current "polyreduce" prototype, but that is a particular solution to a general problem. And *scaling* is exactly the issue that we are trying to address here (since the current model is quite wasteful of computational resources). As for complexity, I do agree that the current prototype that Joydeep linked to is needlessly cumbersome, but, to solve this problem theoretically, the only additional information that would need to be represented to a scheduler to plan a composite job is just a dependency graph between the mappers and reducers. Dependency graphs are about as clear as it gets in this business, and they certainly aren't going to have any scaling problems and conceptual problems. I don't really see how failures would be more catastrophic if this is implemented purely in terms of smarter scheduling (which is adamantly not what the current polyreduce prototype does). You're running the exact same jobs you would be running otherwise, just sooner. :) Chris
