[
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210965#comment-15210965
]
Hitesh Shah commented on TEZ-3187:
----------------------------------
bq. For my own education, how are you concluding that the tasks you identify
are incomplete?
This is mainly based on the history data. There are multiple ways to look at
this:
1) The best way is the Tez UI if you have ATS enabled in YARN.
2) Look for [HISTORY] in the AM logs. This will detail out each dag
state/vertex/task/task attempt state change. In this case I looked for all
TASK_ATTEMPT_STARTED and TASK_ATTEMPT_FINISHED and did a join in some sense to
see which were started but not completed.
Interesting that in this case we dont see a TASK_ATTEMPT_FINISHED in the AM
logs for attempt attempt_1437886552023_169758_3_08_000043_0 even though the
attempt logs show it as done.
Would it be possible for you to sanitize the logs of container
"container_e11_1437886552023_169758_01_000806" and attach them here? It would
be useful to see if there are any errors that the container reported when
trying to talk to the AM. You can use
"https://github.com/hiteshs/dev-tools/blob/master/hadoop-tools/yarn/yarn_app_logs_splitter.py"
to split out the yarn logs into specific files.
\cc [~rajesh.balamohan] [~sseth] in case they have seen this with any runs of
the latest branch with Hive-LLAP.
> Pig on tez hang with java.io.IOException: Connection reset by peer
> ------------------------------------------------------------------
>
> Key: TEZ-3187
> URL: https://issues.apache.org/jira/browse/TEZ-3187
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.8.2
> Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
> Reporter: Kurt Muehlner
> Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt,
> dag_1437886552023_169758_3.dot, syslog_dag_1437886552023_169758_3.gz
>
>
> We are experiencing occasional application hangs, when testing an existing
> Pig MapReduce script, executing on Tez. When this occurs, we find this in
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager]
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout
> delay expired or is new. Releasing container,
> containerId=container_e11_1437886552023_169758_01_000822,
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager]
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout
> delay expired or is new. Releasing container,
> containerId=container_e11_1437886552023_169758_01_000824,
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324]
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
> at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
> at
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
> at
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager]
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout
> delay expired or is new. Releasing container,
> containerId=container_e11_1437886552023_169758_01_000811,
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}]
> |retry.RetryInvocationHandler|: A failover has occurred since the start of
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the
> IOException
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)