Prabhu Joseph created TEZ-3968:
----------------------------------

             Summary: Tez Job Fails with Shuffle failures without rerunning the 
producer MapTask
                 Key: TEZ-3968
                 URL: https://issues.apache.org/jira/browse/TEZ-3968
             Project: Apache Tez
          Issue Type: Improvement
    Affects Versions: 0.7.1
            Reporter: Prabhu Joseph


Tez Job failed with a reduce task failed on all four attempts while fetching a 
particular map output from a Node. NodeManager where MapTask has succeeded was 
stopped and got NM local directories cleared and started again (as disks were 
full). This has caused the shuffle failure in NodeManager as there is no Job 
Token found.

NodeManager Logs shows reason for Shuffle Failure:
{code}
2018-07-05 00:26:00,371 WARN  mapred.ShuffleHandler 
(ShuffleHandler.java:messageReceived(947)) - Shuffle failure
org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't find job 
token for job job_1530690553693_17267 !!
        at 
org.apache.hadoop.mapreduce.security.token.JobTokenSecretManager.retrieveTokenSecret(JobTokenSecretManager.java:112)
        at 
org.apache.hadoop.mapred.ShuffleHandler$Shuffle.verifyRequest(ShuffleHandler.java:1133)
        at 
org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:944)
        at 
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
        at 
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
        at 
org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:142)
        at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
        at 
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
        at 
org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:148)
        at 
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
        at 
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
        at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at 
org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
        at 
org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
        at 
org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at 
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
        at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
        at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
        at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
        at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at 
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at 
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}


Analysis of Application Logs:

Application application_1530690553693_17267 failed with task 
task_1530690553693_17267_4_02_000496 failed on all four attempts.

Four Attempts:

{code}
attempt_1530690553693_17267_4_02_000496_3 -> 
container_e270_1530690553693_17267_01_014554 -> bigdata2.openstacklocal
attempt_1530690553693_17267_4_02_000496_2 -> 
container_e270_1530690553693_17267_01_014423 -> bigdata3.openstacklocal
attempt_1530690553693_17267_4_02_000496_1 -> 
container_e270_1530690553693_17267_01_014311 -> bigdata4.openstacklocal
attempt_1530690553693_17267_4_02_000496_0 -> 
container_e270_1530690553693_17267_01_014613 -> bigdata5.openstacklocal
{code}


All the four attempts failed while fetching a same Map Output:

{code}
2018-07-05 00:26:54,161 [WARN] [fetcher {Map_1} #51] 
|orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting 
to bigdata6.openstacklocal:13562 with 1 inputs pending
java.io.IOException: Server returned HTTP response code: 401 for URL: 
http://bigdata6.openstacklocal:13562/mapOutput?job=job_1530690553693_17267&reduce=496&map=attempt_1530690553693_17267_4_01_000874_0_10003
{code}

The failures are being reported back to the AM correctly in Tez, though it is 
not reported as a "source unhealthy" because the NodeManager is healthy (due to 
the cleanup).

{code}
2018-07-04 23:47:42,344 [INFO] [fetcher {Map_1} #10] 
|orderedgrouped.ShuffleScheduler|: Map_1: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=InputIdentifier 
[inputIndex=874], attemptNumber=0, 
pathComponent=ttempt_1530690553693_17267_4_01_000874_0_10003, spillType=0, 
spillId=-1] taskAttemptIdentifier: Map 1_000874_00 to AM.
{code}

There are approximated 460 errors reported back to the AM like this, which 
keeps getting marked as "fetcher unhealthy" which is probably because the 
restarted NM showed up as healthy.

This scenario of shuffle failures are not handled as NM showed up as healthy. 
Mapper (source InputIdentifier ) has to be marked as unhealthy and rerun.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to