[
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225125#comment-15225125
]
Kurt Muehlner commented on TEZ-3187:
------------------------------------
I'm working on getting the plans for those vertices. In parallel I've been
looking into the most recent application hang, with some interesting results:
I'm confident that the config changes suggested as a workaround by [~daijy]
have improved stability. However, we did have one application hang. Unlike
any previous hang, after a few hours we saw the application recover. Here's a
timeline of interesting events. Please let me know what, if any, logs might be
of interest.
{code}
Timeline:
00:16:06 AM begins processing a DAG.
00:16:27 AM event queue begins to grow rapidly:
2016-04-02 00:16:27,118 [INFO] [IPC Server handler 11 on 56356]
|common.AsyncDispatcher|: Size of event-queue is 1000
2016-04-02 00:16:28,456 [INFO] [IPC Server handler 25 on 56356]
|common.AsyncDispatcher|: Size of event-queue is 2000
2016-04-02 00:16:29,797 [INFO] [IPC Server handler 11 on 56356]
|common.AsyncDispatcher|: Size of event-queue is 3000
2016-04-02 00:16:31,151 [INFO] [IPC Server handler 7 on 56356]
|common.AsyncDispatcher|: Size of event-queue is 4000
00:39:52 event queue grows to size 827000. No more logging about growth
occurs until 01:19:30
00:41:20 task attempts begin to fail with a SocketTimeoutException out when
attempting to connect to AM:
java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for read
01:44:52 last log message for event queue growth in AM:
2016-04-02 01:44:52,569 [INFO] [IPC Server handler 7 on 56356]
|common.AsyncDispatcher|: Size of event-queue is 831000
0:3:34 out of memory error causes AM to exit on a HaltException. This triggers
the dag to be retried, which completes successfully.
{code}
Suggestions for what to investigate and possible workarounds welcome!
> Pig on tez hang with java.io.IOException: Connection reset by peer
> ------------------------------------------------------------------
>
> Key: TEZ-3187
> URL: https://issues.apache.org/jira/browse/TEZ-3187
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.8.2
> Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
> Reporter: Kurt Muehlner
> Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt,
> dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out,
> syslog_dag_1437886552023_169758_3.gz, task_attempts.tar.gz
>
>
> We are experiencing occasional application hangs, when testing an existing
> Pig MapReduce script, executing on Tez. When this occurs, we find this in
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager]
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout
> delay expired or is new. Releasing container,
> containerId=container_e11_1437886552023_169758_01_000822,
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager]
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout
> delay expired or is new. Releasing container,
> containerId=container_e11_1437886552023_169758_01_000824,
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324]
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
> at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
> at
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
> at
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager]
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout
> delay expired or is new. Releasing container,
> containerId=container_e11_1437886552023_169758_01_000811,
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}]
> |retry.RetryInvocationHandler|: A failover has occurred since the start of
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the
> IOException
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)