[jira] [Commented] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion

Siddharth Seth (JIRA) Fri, 24 Apr 2015 14:04:53 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511738#comment-14511738
 ]


Siddharth Seth commented on HIVE-10480:
---------------------------------------

This is what is happening here.

The heartbeat being sent out for a task (from the LLAP daemon) to the AM is 
corrupt - TEZ-2367.
This causes an error to be reported and an Interrupt on the task.
The NPE and Ignoring exception can be ignored - that's caused by the task being 
unregistered as Prasanth pointed out. It's not the root cause of failure, and 
logging it always causes confusion. The log line has already been pruned in Tez 
(yesterday).

Since the daemon considers the task to be dead - it won't send another 
heartbeat to the AM.
The AM has no idea that the task is dead - since the last heartbeat was 
corrupt. The regular timeout mechanism kicks in, and the task is considered 
dead after 5 minutes (the default timeout). 
A new attempt of the same task is setup and runs to completion.

> LLAP: Tez task is interrupted for unknown reason after an IPC exception and 
> then fails to report completion
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-10480
>                 URL: https://issues.apache.org/jira/browse/HIVE-10480
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergey Shelukhin
>
> No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the 
> cluster), or all 3.
> So for now I will just dump all I have here.
> TPCH Q1 started running for a long time for me on large number of runs today 
> (didn't happen yesterday). It would always be one Map task timing out.
>  Example attempt (logs from am):
> {noformat}
> 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] 
> tezplugins.LlapTaskCommunicator: Successfully launched task: 
> attempt_1429683757595_0321_9_00_000928_0
> 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, 
> startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, 
> status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, 
> diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out 
> after 300 secs, counters=Counters: 1, 
> org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1
> {noformat}
> No other lines for this attempt in between.
> However there's this:
> {noformat}
> 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: 
> Unable to read call parameters for client 172.19.128.56on connection protocol 
> org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind 
> RPC_WRITABLE
> java.lang.ArrayIndexOutOfBoundsException
> 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: 
> Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 
> threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable 
> to read call parameters: null]
> {noformat}
> On LLAP, the following is logged 
> {noformat}
> 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR 
> org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: null
>         at org.apache.hadoop.ipc.Client.call(Client.java:1492)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1423)
>         at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)
>         at com.sun.proxy.$Proxy19.heartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258)
>         at 
> org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186)
>         at 
> org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The attempt starts but is then interrupted (not clear by whom)
> {noformat}
> 2015-04-24 11:11:01,144 [Initializer 
> 0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
>  1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: 
> Initialized Input with src edge: lineitem
> 2015-04-24 11:11:01,145 
> [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
>  1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error 
> while executing task: attempt_1429683757595_0321_9_00_000928_0
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
>         at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>         at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
>         at 
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>         at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:218)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:177)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
> {noformat}
> There's a heartbeat error that may or may not be spurious:
> {noformat}
> 2015-04-24 11:11:01,146 
> [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map
>  1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Ignoring the 
> following exception since a previous exception is already registered
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.access$300(LlapTaskReporter.java:129)
>         at 
> org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter.taskFailed(LlapTaskReporter.java:394)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:261)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:52)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:228)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
> {noformat}
> And nothing more for this task.
> Task succeeds on rerun. Other tasks also succeed on this node meanwhile.
> 1) Is it Hadoop IPC bug due to new patch?
> 2) Even if so (assuming IPC is not totally broken) I wonder if the Heartbeat 
> NPE (which we have seen before, without the IPC patch too IIRC) is actually a 
> real issue that prevents the task from being sent to AM?
> 3) Who interrupts the task and why? AM doesn't have any logs about that and 
> it happens after RPC error, not before.
> Btw, there's another flavor or RPC error:
> {noformat}
> 2015-04-24 10:36:30,183 INFO [Socket Reader #1 for port 59446] ipc.Server: 
> Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 
> threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable 
> to read call parameters: 1382376565]
> {noformat}
> application_1429683757595_0320, application_1429683757595_0321



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10480) LLAP: Tez task is interrupted for unknown reason after an IPC exception and then fails to report completion

Reply via email to