[ 
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218442#comment-15218442
 ] 

Hitesh Shah commented on TEZ-3187:
----------------------------------

[~kmuehlner] Thanks for digging into this. The way Tez works is via "events" - 
one example of an event is how the map output informs the reducer input that 
data is ready to consumed from a particular location. The shuffle manager 
hanging implies it has not yet received events from all input sources it 
expects to receive data from.  

[~rajesh.balamohan] Can you take a look at this? Or if  you have any scripts 
which can help Kurt track down which source task's events have not reached the 
downstream waiting task that would be good too?

\cc [~daijy] [~rohini] from the Pig team if they have come across this.  

> Pig on tez hang with java.io.IOException: Connection reset by peer
> ------------------------------------------------------------------
>
>                 Key: TEZ-3187
>                 URL: https://issues.apache.org/jira/browse/TEZ-3187
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2
>         Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>            Reporter: Kurt Muehlner
>         Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, 
> dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, 
> syslog_dag_1437886552023_169758_3.gz
>
>
> We are experiencing occasional application hangs, when testing an existing 
> Pig MapReduce script, executing on Tez.  When this occurs, we find this in 
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000822, 
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000824, 
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>         at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>         at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000811, 
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a 
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
> |retry.RetryInvocationHandler|: A failover has occurred since the start of 
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred 
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the 
> IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to