[jira] [Commented] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer

Rajesh Balamohan (JIRA) Wed, 30 Mar 2016 17:18:38 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219108#comment-15219108
 ]


Rajesh Balamohan commented on TEZ-3187:
---------------------------------------

scope_6778 --> scope_6780, scope_6785
scope_6770 --> scope_6778
scope_6777 --> scope_6778
scope_6771 --> scope_6778

75 tasks are scheduled for scope_6778. Out of this 53 tasks in scope_6778 are 
still running, which blocks downstream vertices like scope_6780, scope_6785 to 
proceed further.

[~kmuehlner]   Is it possible to attach the task logs for any of the following 
tasks? That would help in understanding what these tasks were doing.

{noformat}
 task_1437886552023_169758_3_08_000000 (search for 
syslog_attempt_1437886552023_169758_3_08_000000 in yarn logs)
 task_1437886552023_169758_3_08_000002 (search for 
syslog_attempt_1437886552023_169758_3_08_000002 in yarn logs)
 task_1437886552023_169758_3_08_000005
 task_1437886552023_169758_3_08_000006
 task_1437886552023_169758_3_08_000007
 task_1437886552023_169758_3_08_000008
 task_1437886552023_169758_3_08_000010
 task_1437886552023_169758_3_08_000011
 task_1437886552023_169758_3_08_000012
 task_1437886552023_169758_3_08_000013
 task_1437886552023_169758_3_08_000014
 task_1437886552023_169758_3_08_000015
 task_1437886552023_169758_3_08_000017
 task_1437886552023_169758_3_08_000018
 task_1437886552023_169758_3_08_000019
 task_1437886552023_169758_3_08_000020
 task_1437886552023_169758_3_08_000023
 task_1437886552023_169758_3_08_000024
 task_1437886552023_169758_3_08_000025
 task_1437886552023_169758_3_08_000027
{noformat}

> Pig on tez hang with java.io.IOException: Connection reset by peer
> ------------------------------------------------------------------
>
>                 Key: TEZ-3187
>                 URL: https://issues.apache.org/jira/browse/TEZ-3187
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2
>         Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>            Reporter: Kurt Muehlner
>         Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, 
> dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, 
> syslog_dag_1437886552023_169758_3.gz
>
>
> We are experiencing occasional application hangs, when testing an existing 
> Pig MapReduce script, executing on Tez.  When this occurs, we find this in 
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000822, 
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000824, 
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>         at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>         at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000811, 
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a 
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
> |retry.RetryInvocationHandler|: A failover has occurred since the start of 
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred 
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the 
> IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer

Reply via email to