[ https://issues.apache.org/jira/browse/HIVE-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433507#comment-16433507 ]
Zoltan Haindrich commented on HIVE-19169: ----------------------------------------- to provide better context: I'm working on reoptimizing the queries - so these queries are known to have a bad initial plan (usually map join is used - but its not gonna work; and will eventually get an oom) the problem is that it fails with timeout - in which case there won't be a retry... > llap: Timed out after 90 secs > ----------------------------- > > Key: HIVE-19169 > URL: https://issues.apache.org/jira/browse/HIVE-19169 > Project: Hive > Issue Type: Bug > Components: llap > Reporter: Zoltan Haindrich > Priority: Major > > a more or less recent hive/master with tpcds1000; while running only 1 query > at a time on the cluster; sometimes a "timeout" happens - this could be even > a misconfiguration problem...I'm not sure I've set it up correctly. > what I see from: it seems like the attempt have entered the queue and > accepted - after almost 90 seconds there are messages that it will be > pre-empted - and the 90 sec timeout happens. > hive.log; example: attempt_1522319554594_0065_19_05_000119_14 failed > {code} > 2018-04-10T13:20:25,178 ERROR [HiveServer2-Background-Pool: Thread-4194]: > SessionState (SessionState.java:printError(1214)) - Vertex failed, > vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, > diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, > diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 > secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, > TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt > 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, > TaskAttempt 12 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 > secs], TaskAttempt 13 killed, TaskAttempt 14 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 > secs], TaskAttempt 15 failed, > info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): > attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be > removed > 2018-04-10T13:20:25,260 ERROR [HiveServer2-Background-Pool: Thread-4194]: > ql.Driver (SessionState.java:printError(1214)) - FAILED: Execution Error, > return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, > vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, > diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, > diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 > secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, > TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt > 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, > TaskAttempt 12 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 > secs], TaskAttempt 13 killed, TaskAttempt 14 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 > secs], TaskAttempt 15 failed, > info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): > attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be > removed > org.apache.hive.service.cli.HiveSQLException: Error while processing > statement: FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer > 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, > taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 > killed, TaskAttempt 1 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 > secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, > TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt > 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, > TaskAttempt 12 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 > secs], TaskAttempt 13 killed, TaskAttempt 14 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 > secs], TaskAttempt 15 failed, > info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): > attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be > removed > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, > vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, > diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, > diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 > secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, > TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt > 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, > TaskAttempt 12 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 > secs], TaskAttempt 13 killed, TaskAttempt 14 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 > secs], TaskAttempt 15 failed, > info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): > attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be > removed > {code} > tezam logs for the attempt > {code} > 2018-04-10 13:18:45,430 [INFO] [TaskSchedulerEventHandlerThread] > |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. > task=attempt_1522319554594_0065_19_05_000119_14, priority=194, > capability=<memory:3072, vCores:1>, hosts=null > 2018-04-10 13:18:45,440 [INFO] [LlapScheduler] > |tezplugins.LlapTaskSchedulerService|: Assigned #67763, > task=TaskInfo{task=attempt_1522319554594_0065_19_05_000119_14, priority=194, > startTime=0, containerId=null, uniqueId=69780, > localityDelayTimeout=1127144039} on node={lemon-9.openstacklocal:50155, > id=2a3562c1-1094-499f-9d83-5c3b5cf34663, canAcceptTask=true, st=10, ac=1, > commF=false, disabled=false}, to container=container_222212222_0065_01_067763 > 2018-04-10 13:18:45,440 [INFO] [LlapScheduler] |GuaranteedTasks|: Registering > attempt_1522319554594_0065_19_05_000119_14; false > 2018-04-10 13:18:45,446 [INFO] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: TaskAttempt: > [attempt_1522319554594_0065_19_05_000119_14] submitted. Is using containerId: > [container_222212222_0065_01_067763] on NM: [lemon-9.openstacklocal:50155] > 2018-04-10 13:18:45,446 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1522319554594_0065_19][Event:TASK_ATTEMPT_STARTED]: > vertexName=Reducer 3, > taskAttemptId=attempt_1522319554594_0065_19_05_000119_14, > startTime=1523366325446, containerId=container_222212222_0065_01_067763, > nodeId=lemon-9.openstacklocal:50155 > 2018-04-10 13:18:45,464 [INFO] [TaskCommunicator # 9] > |tezplugins.LlapTaskCommunicator|: Successfully launched task: > attempt_1522319554594_0065_19_05_000119_14 > 2018-04-10 13:18:45,464 [INFO] [TaskCommunicator # 9] |GuaranteedTasks|: > Received response for attempt_1522319554594_0065_19_05_000119_14, true > 2018-04-10 13:19:08,743 [INFO] [IPC Server handler 15 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_05_000128_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > 2018-04-10 13:19:18,728 [INFO] [IPC Server handler 11 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > attempt_1522319554594_0065_19_08_000025_7, > 2018-04-10 13:19:28,726 [INFO] [IPC Server handler 28 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > attempt_1522319554594_0065_19_08_000025_7, > 2018-04-10 13:19:38,726 [INFO] [IPC Server handler 15 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > attempt_1522319554594_0065_19_08_000025_7, > 2018-04-10 13:19:48,726 [INFO] [IPC Server handler 17 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > attempt_1522319554594_0065_19_08_000025_7, > 2018-04-10 13:19:58,726 [INFO] [IPC Server handler 5 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > attempt_1522319554594_0065_19_08_000025_7, > 2018-04-10 13:20:08,726 [INFO] [IPC Server handler 2 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > attempt_1522319554594_0065_19_08_000025_7, > 2018-04-10 13:20:18,727 [INFO] [IPC Server handler 17 on 35408] > |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node > are not there: attempt_1522319554594_0065_19_05_000130_14, > attempt_1522319554594_0065_19_05_000191_14, > attempt_1522319554594_0065_19_08_000018_6, > attempt_1522319554594_0065_19_08_000022_6, > attempt_1522319554594_0065_19_05_000181_14, > attempt_1522319554594_0065_19_05_000131_15, > attempt_1522319554594_0065_19_05_000214_15, > attempt_1522319554594_0065_19_05_000093_11, > attempt_1522319554594_0065_19_05_000119_14, > attempt_1522319554594_0065_19_08_000025_7, > 2018-04-10 13:20:24,594 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1522319554594_0065_19][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 3, > taskAttemptId=attempt_1522319554594_0065_19_05_000119_14, > creationTime=1523366325392, allocationTime=1523366325443, > startTime=1523366325446, finishTime=1523366424594, timeTaken=99148, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, > diagnostics=AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out > after 90 secs, nodeHttpAddress=http://lemon-9.openstacklocal:15002, > counters=Counters: 0 > 2018-04-10 13:20:24,600 [INFO] [TaskSchedulerEventHandlerThread] > |GuaranteedTasks|: Unregistering attempt_1522319554594_0065_19_05_000119_14; > false > 2018-04-10 13:20:24,600 [INFO] [TaskSchedulerEventHandlerThread] > |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for > task=attempt_1522319554594_0065_19_05_000119_14, state=ASSIGNED, > endReason=OTHER > 2018-04-10 13:20:24,603 [INFO] [Dispatcher thread {Central}] > |node.AMNodeImpl|: Attempt failedon node: lemon-9.openstacklocal:50155 TA: > attempt_1522319554594_0065_19_05_000119_14, container: > container_222212222_0065_01_067763, numFailedTAs: 57 > 2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] > |impl.VertexImpl|: Source task attempt completed for vertex: > vertex_1522319554594_0065_19_07 [Reducer 10] attempt: > attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: > RUNNING > 2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] > |impl.VertexImpl|: Source task attempt completed for vertex: > vertex_1522319554594_0065_19_11 [Reducer 4] attempt: > attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: > RUNNING > 2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] > |impl.VertexImpl|: Source task attempt completed for vertex: > vertex_1522319554594_0065_19_06 [Reducer 9] attempt: > attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: > RUNNING > TaskAttempt 14 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 > secs] > 2018-04-10 13:20:25,020 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1522319554594_0065_19][Event:VERTEX_FINISHED]: > vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, > initRequestedTime=1523363710631, initedTime=1523363710645, > startRequestedTime=1523363713645, startedTime=1523363713646, > finishTime=1523366425017, timeTaken=2711371, status=FAILED, diagnostics=Task > failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt > 0 killed, TaskAttempt 1 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 > secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, > TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt > 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, > TaskAttempt 12 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 > secs], TaskAttempt 13 killed, TaskAttempt 14 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 > secs], TaskAttempt 15 failed, > info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): > attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be > removed > 2018-04-10 13:20:25,157 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1522319554594_0065_19][Event:DAG_FINISHED]: > dagId=dag_1522319554594_0065_19, startTime=1523363710305, > finishTime=1523366425121, timeTaken=2714816, status=FAILED, > diagnostics=Vertex failed, vertexName=Reducer 3, > vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, > taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 > killed, TaskAttempt 1 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 > secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, > TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt > 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, > TaskAttempt 12 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 > secs], TaskAttempt 13 killed, TaskAttempt 14 failed, > info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 > secs], TaskAttempt 15 failed, > info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): > attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be > removed > {code} > from llap0 logs; filtered for attempt_1522319554594_0065_19_05_000119_14 ; > ordered by time: > {code} > java.lang.RuntimeException: attempt_1522319554594_0065_19_05_000119_14 was > not registered and couldn't be removed > 2018-04-10T13:18:45,455 INFO [IPC Server handler 3 on 50155 ()] > org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: Queueing > container for execution: > fragemendId=attempt_1522319554594_0065_19_05_000119_14, > am_details=lemon-8.openstacklocal:35408, taskInfo= fragment 119 attempt 14, > user=hive, queryId=hive_20180410113951_2fc85570-df7a-4cb1-8499-74d11f227e0d, > appIdString=application_1522319554594_0065, appAttemptNum=1, > containerIdString=container_222212222_0065_01_067763, dagName=with ws as > 2018-04-10T13:18:45,457 INFO [IPC Server handler 3 on 50155 > (1522319554594_0065_19_05_000119_14)] impl.ContainerRunnerImpl: > SubmissionState for attempt_1522319554594_0065_19_05_000119_14 : ACCEPTED > 2018-04-10T13:19:13,533 INFO [Wait-Queue-Scheduler-0 ()] > org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Adding > attempt_1522319554594_0065_19_05_000119_14 to pre-emption queue > 2018-04-10T13:19:13,533 INFO [Wait-Queue-Scheduler-0 ()] > org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Attempting to > execute TaskWrapper{task=attempt_1522319554594_0065_19_05_000119_14, > inWaitQueue=true, inPreemptionQueue=false, registeredForNotifications=true, > canFinish=true, canFinish(in queue)=true, isGuaranteed=false, > firstAttemptStartTime=1523364757720, dagStartTime=1523363710305, > withinDagPriority=194, vertexParallelism= 297, selfAndUpstreamParallelism= > 1286, selfAndUpstreamComplete= 1243} > 2018-04-10T13:19:13,534 WARN [ExecutionCompletionThread #0 ()] > org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Received > onSuccess/onFailure for a fragment for which a completing message was not > received: attempt_1522319554594_0065_19_05_000119_14 > 2018-04-10T13:19:13,536 ERROR [ExecutionCompletionThread #0 ()] > org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Failed > notification received: Stacktrace: java.lang.RuntimeException: > attempt_1522319554594_0065_19_05_000119_14 was not registered and couldn't be > removed > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)