[jira] [Commented] (MAPREDUCE-460) Should be able to re-run jobs, collecting only missing output

Allen Wittenauer (JIRA) Thu, 17 Jul 2014 21:57:28 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066019#comment-14066019
 ]


Allen Wittenauer commented on MAPREDUCE-460:
--------------------------------------------

So is there actually another JIRA that is covering this and it just isn't 
linked to this one?  Is there really any reason to keep this and MAPREDUCE-443 
open?

> Should be able to re-run jobs, collecting only missing output
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-460
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>            Reporter: Bryan Pendleton
>
> For jobs with no side effects (roughly == jobs with speculative execution 
> enabled), if partial output has been generated, it should be possible to 
> re-run the job, and fill in the missing pieces. I have now run the same job 
> twice, once finishing 42 of 44 reduce tasks, another time finishing only 17. 
> Each time, many nodes have failed, causing many many tasks to fail ( in one 
> case, 5k failures from 15k map tasks, 23 failures from 44 reduces), but some 
> valid output was generated. Since the output is only dependent on the input, 
> and both jobs used the same input, I will now be able to combine these two 
> failed task outputs to get a completed job's output. This should be something 
> that can be more automatic.
> In particular, it should be possible to resubmit a job, with a list of 
> partitions that should be ignored. A special Combiner, or pre-Combiner, would 
> throw out any map output for partitions that have already been successfully 
> completed, thus reducing the amount of data that needs to be reduced to 
> complete the job. It would, of course, be nice to support "filling in" 
> existing outputs, rather than having to do a move operation on completed 
> outputs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-460) Should be able to re-run jobs, collecting only missing output

Reply via email to