[ 
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212648#comment-15212648
 ] 

Kurt Muehlner commented on TEZ-3187:
------------------------------------

In addition to the possible issue of communication to the standby namenode, it 
seems we have tasks which are not making progress, and not reporting the lack 
of progress to the AM.  When I killed the application, the resulting interrupt 
generated this stack trace in a task attempt:

{code}
2016-03-25 10:18:33,590 [WARN] [TezChild] |readers.UnorderedKVReader|: 
Interrupted while waiting for next available input
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:489)
at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:678)
at 
org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.getNextInput(ShuffleManager.java:857)
at 
org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput(UnorderedKVReader.java:188)
at 
org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next(UnorderedKVReader.java:122)
at 
org.apache.tez.runtime.library.input.ConcatenatedMergedKeyValueInput$ConcatenatedMergedKeyValueReader.next(ConcatenatedMergedKeyValueInput.java:52)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POValueInputTez.getNextTuple(POValueInputTez.java:124)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:119)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:319)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:196)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:351)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:59)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:59)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:36)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

This led me to TEZ-808, and the new config param in 0.8.2 
'tez.task.progress.stuck.interval-ms', which sounded promising.  Unfortunately, 
when I set that parameter, all tasks fail immediately, regardless of timeout 
chosen.  This may be because PIG does not call into the progress API, as 
tracked in PIG-4700.  I'd love to get some insight into that.

Of course this doesn't get to the root cause of why the task was stuck, but the 
ability for the AM to kill and then retry these tasks in Pig on Tez would be 
great.

> Pig on tez hang with java.io.IOException: Connection reset by peer
> ------------------------------------------------------------------
>
>                 Key: TEZ-3187
>                 URL: https://issues.apache.org/jira/browse/TEZ-3187
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2
>         Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>            Reporter: Kurt Muehlner
>         Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, 
> dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, 
> syslog_dag_1437886552023_169758_3.gz
>
>
> We are experiencing occasional application hangs, when testing an existing 
> Pig MapReduce script, executing on Tez.  When this occurs, we find this in 
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000822, 
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000824, 
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>         at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>         at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000811, 
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a 
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
> |retry.RetryInvocationHandler|: A failover has occurred since the start of 
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred 
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the 
> IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to