[ 
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210757#comment-15210757
 ] 

Kurt Muehlner commented on TEZ-3187:
------------------------------------

I'll get the .dot file uploaded shortly.  Full application logs is a lot 
harder, as I'm not able to just pull them from 'yarn logs' and release them.  
If possible, I'd prefer to try to provide individual logs that seem necessary 
to track down the root cause.

For my own education, how are you concluding that the tasks you identify are 
incomplete?  Picking one of them, attempt_1437886552023_169758_3_08_000043_0, 
which ran on 10.102.173.86 (which is prod015), the task attempt seems to be 
completed.  In stdout on that host I see:

2016-03-21 16:38:53 Starting to run new task attempt: 
attempt_1437886552023_169758_3_08_000043_0
2016-03-21 16:38:54 Completed running task attempt: 
attempt_1437886552023_169758_3_08_000043_0

And in log from the tez child:
2016-03-21 16:38:54,956 [INFO] [TezChild] |task.TaskRunner2Callable|: Cleaning 
up task attempt_1437886552023_169758_3_08_000043_0, stopRequested=false
2016-03-21 16:38:54,957 [INFO] [TezChild] 
|runtime.LogicalIOProcessorRuntimeTask|: Final Counters for 
attempt_1437886552023_169758_3_08_000043_0: Counters: 48 [[File System Counters 
FILE_BYTES_READ=1915463, FILE_BYTES_WRITTEN=5178225, FILE_READ_OPS=0, 
FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=0, 
HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=0, HDFS_LARGE_READ_OPS=0, 
HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter 
REDUCE_INPUT_GROUPS=1243, REDUCE_INPUT_RECORDS=1243, COMBINE_INPUT_RECORDS=0, 
SPILLED_RECORDS=1243, NUM_SHUFFLED_INPUTS=76, NUM_SKIPPED_INPUTS=149, 
NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=76, GC_TIME_MILLIS=0, 
CPU_MILLISECONDS=2910, PHYSICAL_MEMORY_BYTES=1607991296, 
VIRTUAL_MEMORY_BYTES=4007055360, COMMITTED_HEAP_BYTES=1607991296, 
OUTPUT_RECORDS=2458, OUTPUT_LARGE_RECORDS=0, OUTPUT_BYTES=9153916, 
OUTPUT_BYTES_WITH_OVERHEAD=9163762, OUTPUT_BYTES_PHYSICAL=3464778, 
ADDITIONAL_SPILLS_BYTES_WRITTEN=1713431, ADDITIONAL_SPILLS_BYTES_READ=1746947, 
ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=1778766, 
SHUFFLE_BYTES_DECOMPRESSED=4535286, SHUFFLE_BYTES_TO_MEM=1745250, 
SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=33516, 
NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=558, 
MERGE_PHASE_TIME=627, FIRST_EVENT_RECEIVED=127, 
LAST_EVENT_RECEIVED=511][Shuffle Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, 
WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0]]
2016-03-21 16:38:54,960 [INFO] [main] |task.TezTaskRunner2|: TaskRunnerResult 
for attempt_1437886552023_169758_3_08_000043_0 : 
TaskRunner2Result{endReason=SUCCESS, error=null, 
containerShutdownRequested=false}




> Pig on tez hang with java.io.IOException: Connection reset by peer
> ------------------------------------------------------------------
>
>                 Key: TEZ-3187
>                 URL: https://issues.apache.org/jira/browse/TEZ-3187
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2
>         Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>            Reporter: Kurt Muehlner
>         Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, 
> syslog_dag_1437886552023_169758_3.gz
>
>
> We are experiencing occasional application hangs, when testing an existing 
> Pig MapReduce script, executing on Tez.  When this occurs, we find this in 
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000822, 
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000824, 
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>         at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>         at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000811, 
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a 
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
> |retry.RetryInvocationHandler|: A failover has occurred since the start of 
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred 
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the 
> IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to