[ 
https://issues.apache.org/jira/browse/TEZ-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997635#comment-16997635
 ] 

TezQA commented on TEZ-4108:
----------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 
19s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} yetus {color} | {color:red}  0m  7s{color} 
| {color:red} Unprocessed flag(s): --jenkins {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 base: 
https://builds.apache.org/job/PreCommit-TEZ-Build/228/artifact/out/Dockerfile |
| JIRA Issue | TEZ-4108 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12988841/TEZ-4108.001.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-TEZ-Build/228/console |
| versions | git=2.7.4 |
| Powered by | Apache Yetus 0.11.1 https://yetus.apache.org |


This message was automatically generated.



> NullPointerException during speculative execution race condition
> ----------------------------------------------------------------
>
>                 Key: TEZ-4108
>                 URL: https://issues.apache.org/jira/browse/TEZ-4108
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Turner Eagles
>            Assignee: Jonathan Turner Eagles
>            Priority: Major
>         Attachments: TEZ-4108.001.patch
>
>
> What is happening is a race condition between speculative execution task 
> launch and failure. Such that when the speculative attempt begins to launch, 
> the earlier attempt is no longer running. Somehow the containerNodeId is 
> becoming null. From the log line we can see that at least when the original 
> task attempt starts, the container node id is not null.
> This points to a bug in Tez.
> Posting AM log context and full stack trace.
> {noformat}
> 2019-11-18 12:58:13,261 [INFO] [Dispatcher thread {Central}] 
> |HistoryEventHandler.criticalEvents|: 
> [HISTORY][DAG:dag_1571959303156_328042_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=Map 1, taskAttemptId=attempt_1571959303156_328042_1_00_091618_0, 
> startTime=1574081893261, 
> containerId=container_e20_1571959303156_328042_01_006583, 
> nodeId=node1.example.com:8041
> 2019-11-18 12:58:25,001 [INFO] [Dispatcher thread {Central}] 
> |HistoryEventHandler.criticalEvents|: 
> [HISTORY][DAG:dag_1571959303156_328042_1][Event:TASK_STARTED]: vertexName=Map 
> 1, taskId=task_1571959303156_328042_1_00_091618, scheduledTime=1574081308038, 
> launchTime=1574081893261
> 2019-11-18 12:58:26,916 [INFO] [Dispatcher thread {Central}] 
> |HistoryEventHandler.criticalEvents|: 
> [HISTORY][DAG:dag_1571959303156_328042_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1571959303156_328042_1_00_091618_0, 
> creationTime=1574081308038, allocationTime=1574081880179, 
> startTime=1574081893261, finishTime=1574081906915, timeTaken=13654, 
> status=KILLED, errorEnum=EXTERNAL_PREEMPTION, diagnostics=Container 
> container_e20_1571959303156_328042_01_006583 finished with diagnostics set to 
> [Container preempted externally. Container Preempted due to node overcommit
> 2019-11-18 12:58:27,264 [INFO] [Dispatcher thread {Speculator}] 
> |legacy.LegacySpeculator|: DefaultSpeculator.addSpeculativeAttempt -- we are 
> speculating task_1571959303156_328042_1_00_091618
> 2019-11-18 12:58:34,633 [INFO] [TaskSchedulerEventHandlerThread] 
> |rm.TaskSchedulerManager|: Task: attempt_1571959303156_328042_1_00_091618_0 
> has no container assignment in the scheduler
> 2019-11-18 12:58:34,636 [ERROR] [TaskSchedulerEventHandlerThread] 
> |rm.TaskSchedulerManager|: No container allocated to task: 
> attempt_1571959303156_328042_1_00_091618_0 according to scheduler. Task 
> reported container id: container_e20_1571959303156_328042_01_006583
> 2019-11-18 12:58:34,866 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: 
> Scheduling a redundant attempt for task task_1571959303156_328042_1_00_091618
> 2019-11-18 12:58:34,866 [ERROR] [Dispatcher thread {Central}] 
> |impl.TaskImpl|: Uncaught exception when trying handle event 
> T_ADD_SPEC_ATTEMPT at current state RUNNING for task 
> task_1571959303156_328042_1_00_091618
> java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
>         at 
> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
>         at java.util.Collections$SetFromMap.add(Collections.java:5461)
>         at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$RedundantScheduleTransition.transition(TaskImpl.java:1044)
>         at 
> org.apache.tez.dag.app.dag.impl.TaskImpl$RedundantScheduleTransition.transition(TaskImpl.java:1032)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:59)
>         at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:841)
>         at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:115)
>         at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2311)
>         at 
> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:2298)
>         at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:180)
>         at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> {code:title=TaskImpl.java}
> 1032   private static class RedundantScheduleTransition
> 1033     implements SingleArcTransition<TaskImpl, TaskEvent> {
> 1034
> 1035     @Override
> 1036     public void transition(TaskImpl task, TaskEvent event) {
> 1037       LOG.info("Scheduling a redundant attempt for task " + task.taskId);
> 1038       
> task.counters.findCounter(TaskCounter.NUM_SPECULATIONS).increment(1);
> 1039       TaskAttempt earliestUnfinishedAttempt = null;
> 1040       for (TaskAttempt ta : task.attempts.values()) {
> 1041         // find the oldest running attempt
> 1042         if (!ta.isFinished()) {
> 1043           earliestUnfinishedAttempt = ta;
> 1044           task.nodesWithRunningAttempts.add(ta.getNodeId());
> 1045         } else {
> 1046           if (TaskAttemptState.SUCCEEDED.equals(ta.getState())) {
> 1047             LOG.info("Ignore speculation scheduling for task {} since it 
> has succeeded with attempt {}.",
> 1048                 task.getTaskId(), ta.getID());
> 1049             return;
> 1050           }
> 1051         }
> 1052       }
> 1053       if (earliestUnfinishedAttempt == null) {
> 1054         // no running (or SUCCEEDED) task attempt at this moment, no 
> need to schedule speculative attempt either
> 1055         LOG.info("Ignore speculation scheduling since there is no 
> running attempt on task {}.", task.getTaskId());
> 1056         return;
> 1057       }
> 1058       if (task.commitAttempt != null) {
> 1059         LOG.info("Ignore speculation scheduling for task {} since commit 
> has started with commitAttempt {}.",
> 1060             task.getTaskId(), task.commitAttempt);
> 1061         return;
> 1062       }
> 1063       task.addAndScheduleAttempt(earliestUnfinishedAttempt.getID());
> 1064     }
> 1065   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to