[ 
http://issues.apache.org/jira/browse/HADOOP-223?page=comments#action_12432196 ] 
            
Bryan Pendleton commented on HADOOP-223:
----------------------------------------

This issue still plagues me. I have a large enough dataset, that processing 
through it and storing temporary outputs often causes many failures. I still 
have my local fix for re-running (very cumbersome). I'd love to see a general 
solution supported.

What I currently do is to inject a special combiner, that only produces output 
for partitions that didn't complete in a previous run. Thus, if 20% of my job 
finishes (completes writing reduce output to DFS), I need 20% less working 
space to run the subsequent job attempt. Usually, this is enough progress.

However, that suffers from a lot of limits:
1) There's no auto-magic storage of what partitions were generated.
2) This loses the ability to have a Combiner doing something else
3) It also requires a new output directory, and hand-reassembly of the 
completed tasks.

It seems like it should be solvable more generally by:
1) Storing the partitions that are generated when a job is initially run.
2) Keeping track of what partitions have completed.
3) Allowing a job to be re-submitted, using the previous partitions (and not 
failing on the output existance check)

Anyone else run into these issues? I'd be happy to write up how I actually work 
around the problem now, but it seems like a more general solution would be 
helpful to a wider audience.

> Should be able to re-run jobs, collecting only missing output
> -------------------------------------------------------------
>
>                 Key: HADOOP-223
>                 URL: http://issues.apache.org/jira/browse/HADOOP-223
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Bryan Pendleton
>            Priority: Minor
>
> For jobs with no side effects (roughly == jobs with speculative execution 
> enabled), if partial output has been generated, it should be possible to 
> re-run the job, and fill in the missing pieces. I have now run the same job 
> twice, once finishing 42 of 44 reduce tasks, another time finishing only 17. 
> Each time, many nodes have failed, causing many many tasks to fail ( in one 
> case, 5k failures from 15k map tasks, 23 failures from 44 reduces), but some 
> valid output was generated. Since the output is only dependent on the input, 
> and both jobs used the same input, I will now be able to combine these two 
> failed task outputs to get a completed job's output. This should be something 
> that can be more automatic.
> In particular, it should be possible to resubmit a job, with a list of 
> partitions that should be ignored. A special Combiner, or pre-Combiner, would 
> throw out any map output for partitions that have already been successfully 
> completed, thus reducing the amount of data that needs to be reduced to 
> complete the job. It would, of course, be nice to support "filling in" 
> existing outputs, rather than having to do a move operation on completed 
> outputs.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to