[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293374#comment-16293374
 ] 

Jim Brennan commented on MAPREDUCE-6948:
----------------------------------------

I have been unable to reproduce this problem in trunk, nor in branch-2.8.   I 
have been able to repro in branch-2.7, but only by adding a sleep to exacerbate 
the race condition.

Analysis:
The key point in the failure case is here:
{{2017-08-30 10:12:22,000 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:transition(1953)) - Num completed Tasks: 1
2017-08-30 10:12:22,029 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:transition(1953)) - Num completed Tasks: 2
2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:actOnUnusableNode(1354)) - TaskAttempt killed because it ran on 
unusable node Mock for NodeId, hashCode: 1280187896. 
AttemptId:attempt_1234567890000_0001_m_000000_0
2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl 
(JobImpl.java:transition(1953)) - Num completed Tasks: 3
}}
At this point Num completed tasks should be 2.  Since it is 3, we start moving 
to the COMMITTED state too early and trip the failure.
In the successful case, the log looks like this:
{{2017-12-15 16:16:54,253 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 1
2017-12-15 16:16:54,258 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 2
2017-12-15 16:16:54,260 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:actOnUnusableNode(1359)) - TaskAttempt killed because it ran on 
unusable node Mock for NodeId, hashCode: 131679889. 
AttemptId:attempt_1234567890000_0001_m_000000_0
2017-12-15 16:16:54,261 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 2
2017-12-15 16:16:54,262 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:checkReadyForCompletionWhenAllReducersDone(2103)) - Killing map 
task task_1234567890000_0001_m_000000
2017-12-15 16:16:54,263 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:checkReadyForCompletionWhenAllReducersDone(2103)) - Killing map 
task task_1234567890000_0001_m_000001
2017-12-15 16:16:54,263 INFO  [Thread-0] impl.JobImpl 
(JobImpl.java:transition(1979)) - Num completed Tasks: 3}}

The second Num Completed Tasks:2 line corresponds to when we mark the Reducer 
task as SUCCEEDED.  At this point, the count of succeeded map tasks should be 
1, because it was just decremented due to the unusable node.  It is incremented 
to 2 before printing.

The difference between branch-2.7, which fails, and trunk/branch-2.8 is the fix 
in MAPREDUCE-6675, which switched it to use a DrainDispatcher and added a 
dispatcher.await() call before we complete the reducer.

Another possible factor is YARN-5436, which fixed a very similar race in 
DrainDispatcher.  That one is present in trunk, but not in branch-2.8.  So it 
may account for intermittent failures in branch-2.8, but I was not able to 
reproduce it.

So as far as I can tell, this appears to be fixed already.

[~haibo.chen], can you provide any insight?  Any chance this failure was seen 
on branch-2.8 or branch-2.7?



> TestJobImpl.testUnusableNodeTransition failed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-6948
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6948
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Haibo Chen
>            Assignee: Jim Brennan
>              Labels: unit-test
>
> *Error Message*
> expected:<SUCCEEDED> but was:<ERROR>
> *Stacktrace*
> java.lang.AssertionError: expected:<SUCCEEDED> but was:<ERROR>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:743)
>       at org.junit.Assert.assertEquals(Assert.java:118)
>       at org.junit.Assert.assertEquals(Assert.java:144)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.assertJobState(TestJobImpl.java:1041)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.testUnusableNodeTransition(TestJobImpl.java:615)
> *Standard out*
> {code}
> 2017-08-30 10:12:21,928 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventType for class 
> org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler
> 2017-08-30 10:12:21,939 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl$StubbedJob
> 2017-08-30 10:12:21,940 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.job.event.TaskEventType for class 
> org.apache.hadoop.yarn.event.EventHandler$$EnhancerByMockitoWithCGLIB$$79f96ebf
> 2017-08-30 10:12:21,940 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.jobhistory.EventType for class 
> org.apache.hadoop.yarn.event.EventHandler$$EnhancerByMockitoWithCGLIB$$79f96ebf
> 2017-08-30 10:12:21,940 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class 
> org.apache.hadoop.yarn.event.EventHandler$$EnhancerByMockitoWithCGLIB$$79f96ebf
> 2017-08-30 10:12:21,941 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:setup(1534)) - Adding job token for job_1234567890000_0001 to 
> jobTokenSecretManager
> 2017-08-30 10:12:21,941 WARN  [Thread-49] impl.JobImpl 
> (JobImpl.java:setup(1540)) - Shuffle secret key missing from job credentials. 
> Using job token secret as shuffle secret.
> 2017-08-30 10:12:21,944 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:makeUberDecision(1305)) - Not uberizing job_1234567890000_0001 
> because: not enabled;
> 2017-08-30 10:12:21,944 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:createMapTasks(1562)) - Input size for job 
> job_1234567890000_0001 = 0. Number of splits = 2
> 2017-08-30 10:12:21,945 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:createReduceTasks(1579)) - Number of reduces for job 
> job_1234567890000_0001 = 1
> 2017-08-30 10:12:21,945 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from NEW 
> to INITED
> 2017-08-30 10:12:21,946 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from 
> INITED to SETUP
> 2017-08-30 10:12:21,954 INFO  [CommitterEvent Processor #0] 
> commit.CommitterEventHandler (CommitterEventHandler.java:run(231)) - 
> Processing the event EventType: JOB_SETUP
> 2017-08-30 10:12:21,978 INFO  [AsyncDispatcher event handler] impl.JobImpl 
> (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from 
> SETUP to RUNNING
> 2017-08-30 10:12:21,983 INFO  [Thread-49] event.AsyncDispatcher 
> (AsyncDispatcher.java:register(209)) - Registering class 
> org.apache.hadoop.mapreduce.v2.app.job.event.TaskAttemptEventType for class 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl$5
> 2017-08-30 10:12:22,000 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:transition(1953)) - Num completed Tasks: 1
> 2017-08-30 10:12:22,029 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:transition(1953)) - Num completed Tasks: 2
> 2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:actOnUnusableNode(1354)) - TaskAttempt killed because it ran on 
> unusable node Mock for NodeId, hashCode: 1280187896. 
> AttemptId:attempt_1234567890000_0001_m_000000_0
> 2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:transition(1953)) - Num completed Tasks: 3
> 2017-08-30 10:12:22,032 INFO  [Thread-49] impl.JobImpl 
> (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from 
> RUNNING to COMMITTING
> 2017-08-30 10:12:22,032 ERROR [Thread-49] impl.JobImpl 
> (JobImpl.java:handle(1009)) - Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> JOB_TASK_COMPLETED at COMMITTING
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1007)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TestJobImpl.testUnusableNodeTransition(TestJobImpl.java:609)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> 2017-08-30 10:12:22,035 INFO  [AsyncDispatcher event handler] impl.JobImpl 
> (JobImpl.java:handle(1017)) - job_1234567890000_0001Job Transitioned from 
> COMMITTING to ERROR
> 2017-08-30 10:12:22,039 INFO  [CommitterEvent Processor #1] 
> commit.CommitterEventHandler (CommitterEventHandler.java:run(231)) - 
> Processing the event EventType: JOB_COMMIT
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to