[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185245#comment-13185245
 ] 

Siddharth Seth commented on MAPREDUCE-3656:
-------------------------------------------

Updating the patch with the review comments.
bq. For handling the case where JVM is unregistered before it gets a task, we 
should remove it from launchedJVMs during unregister. Once we do this, we 
should think about synchronization issues carefully.
Good catch. Uploading a patch which removes the jvm from the launchedJVMs set 
in the unregister call, prior to removing from jvmIDToActiveAttemptMap. The 
ordering of events between unregister and getTask should take care of 
synchronization issues.

bq. We went through a couple of iterations on this part of the code, so let us 
make sure things are fine by running the AMScalability benchmark (100K maps) 
once.
Have already run a sort benchmark with the previous patch and the patch from 
MAPREDUCE-3596, which passed. Can run AMScalability as well - but this issue 
has never been seen with AMScalability (shows up primarily when shuffle starts 
and the startContainer calls slow down).

Another change which can be made is to have TaskAttemptListener / 
TaskHeartbeatHandler throw Exceptions for calls from unregistered tasks. 
Currently the AM relies on the NM stopContainer to kill these tasks. Opening a 
separate jira for this. Also one for the NM startContainer calls slowing down. 
                
> Sort job on 350 scale is consistently failing with latest MRV2 code 
> --------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3656
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3656
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.1
>            Reporter: Karam Singh
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3656.txt, MR3656.txt
>
>
> With the code checked out on last two days. 
> Sort Job on 350 node scale with 16800 maps and 680 reduces consistently 
> failing for around last 6 runs
> When around 50% of maps are completed, suddenly job jumps to failed state.
> On looking at NM log, found RM sent Stop Container Request to NM for AM 
> container.
> But at INFO level from RM log not able find why RM is killing AM when job is 
> not killed manually.
> One thing found common on failed AM logs is -:
> org.apache.hadoop.yarn.state.InvalidStateTransitonException
> With with different.
> For e.g. One log says -:
> {code}
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> TA_UPDATE at ASSIGNED 
> {code}
> Whereas other logs says -:
> {code}
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> JOB_COUNTER_UPDATE at ERROR
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to