[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3846:
-----------------------------------------------

    Attachment: MAPREDUCE-3846-20120210.txt

If we log all TaskAttempts (even before launch), we may perhaps avoid this, but 
I am not sure. So for now, I changed the attemptsNumbers generation during 
recovery to first use the numbers from previous generation and then jump after 
all those numbers are exhausted.

I also made sure that attempts are replayed correctly in the order of original 
start times, otherwise (as my test revealed), we may be replaying in wrong 
order with wrong times.


The test fails without the patch and passes with.

Sharad, can you please look at the patch and see if it makes sense? Thanks in 
advance!
                
> Restarted+Recovered AM hangs in some corner cases
> -------------------------------------------------
>
>                 Key: MAPREDUCE-3846
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Critical
>         Attachments: MAPREDUCE-3846-20120210.txt
>
>
> [~karams] found this while testing AM restart/recovery feature. After the 
> first generation AM crashes (manually killed by kill -9), the second 
> generation AM starts, but hangs after a while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to