Should be able to re-run jobs, collecting only missing output
-------------------------------------------------------------
Key: HADOOP-223
URL: http://issues.apache.org/jira/browse/HADOOP-223
Project: Hadoop
Type: New Feature
Components: mapred
Reporter: Bryan Pendleton
Priority: Minor
For jobs with no side effects (roughly == jobs with speculative execution
enabled), if partial output has been generated, it should be possible to re-run
the job, and fill in the missing pieces. I have now run the same job twice,
once finishing 42 of 44 reduce tasks, another time finishing only 17. Each
time, many nodes have failed, causing many many tasks to fail ( in one case, 5k
failures from 15k map tasks, 23 failures from 44 reduces), but some valid
output was generated. Since the output is only dependent on the input, and both
jobs used the same input, I will now be able to combine these two failed task
outputs to get a completed job's output. This should be something that can be
more automatic.
In particular, it should be possible to resubmit a job, with a list of
partitions that should be ignored. A special Combiner, or pre-Combiner, would
throw out any map output for partitions that have already been successfully
completed, thus reducing the amount of data that needs to be reduced to
complete the job. It would, of course, be nice to support "filling in" existing
outputs, rather than having to do a move operation on completed outputs.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira