I agree with Owen and Doug. As long as the intermediate outputs (i.e. Data
in between phases) are stored on tasktrackers' local disks, prone to
failure, having more than two phases will be counterproductive. If
intermediate data storage were on a fault-tolerant DFS, one would see more
benefits of chaining arbitrary sequence of phases. (But then the reasoning
in the original email for having multiple-phases, i.e not having to upload
data to DFS, would no longer be valid.)

- milind


On 8/24/07 9:53 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Ted Dunning wrote:
>> It isn't hard to implement these programs as multiple fully fledged
>> map-reduces, but it appears to me that many of them would be better
>> expressed as something more like a map-reduce-reduce program.
>> 
>> [ ... ]
>> 
>> Expressed conventionally, this would have write all of the user sessions to
>> HDFS and a second map phase would generate the pairs for counting.  The
>> opportunity for efficiency would come from the ability to avoid writing
>> intermediate results to the distributed data store.
>>     
>> Has anybody looked at whether this would help and whether it would be hard
>> to do?
> 
> It would job tracker more complicated, and might not help job execution
> time that much.
> 
> Consider implementing this as multiple map reduce steps, but using a
> replication level of one for intermediate data.  That would mostly have
> the performance characteristics you want.  But if a node died, things
> could not intelligently automatically re-create just the missing data.
> Instead the application would have to re-run the entire job, or subsets
> of it, in order to re-create the un-replicated data.
> 
> Under poly-reduce, if a node failed, all tasks that were incomplete on
> that node would need to be restarted.  But first, their input data would
> need to be located.  If you saved all intermediate data in the course of
> a job (which would be expensive) then the inputs that need re-creation
> would mostly just be those that were created on the failed node.  But
> this failure would generally cascade all the way back to the initial map
> stage.  So a single machine failure in the last phase could double the
> run time of the job, with most of the cluster idle.
> 
> If, instead, you used normal mapreduce, with intermediate data
> replicated in the filesystem, a single machine failure in the last phase
> would only require re-running tasks from the last job.
> 
> Perhaps, when chaining mapreduces, one should use a lower replication
> level for intermediate data, like two.  Additionally, one might wish to
> relax the one-replica-off-rack criterion for such files, so that
> replication is faster, and since whole-rack failures are rare.  This
> might give good chained performance, but keep machine failures from
> knocking tasks back to the start of the chain.  Currently its not
> possible to disable the one-replica-off-rack preference, but that might
> be a reasonable feature request.
> 
> Doug
> 

--
Milind Bhandarkar
408-349-2136
([EMAIL PROTECTED])

Reply via email to