[ https://issues.apache.org/jira/browse/TEZ-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kuhu Shukla resolved TEZ-3968. ------------------------------ Resolution: Duplicate This is similar to TEZ-3910. Closing this as a duplicate. Thank you [~Prabhu Joseph] for the report. > Tez Job Fails with Shuffle failures too fast when NM returns a 401 error > ------------------------------------------------------------------------ > > Key: TEZ-3968 > URL: https://issues.apache.org/jira/browse/TEZ-3968 > Project: Apache Tez > Issue Type: Improvement > Affects Versions: 0.7.1 > Reporter: Prabhu Joseph > Priority: Major > > Tez Job failed with a reduce task failed on all four attempts while fetching > a particular map output from a Node. NodeManager where MapTask has succeeded > was stopped and got NM local directories cleared and started again (as disks > were full). This has caused the shuffle failure in NodeManager as there is no > Job Token found. > NodeManager Logs shows reason for Shuffle Failure: > {code} > 2018-07-05 00:26:00,371 WARN mapred.ShuffleHandler > (ShuffleHandler.java:messageReceived(947)) - Shuffle failure > org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't find job > token for job job_1530690553693_17267 !! > at > org.apache.hadoop.mapreduce.security.token.JobTokenSecretManager.retrieveTokenSecret(JobTokenSecretManager.java:112) > at > org.apache.hadoop.mapred.ShuffleHandler$Shuffle.verifyRequest(ShuffleHandler.java:1133) > at > org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:944) > {code} > Analysis of Application Logs: > Application application_1530690553693_17267 failed with task > task_1530690553693_17267_4_02_000496 failed on all four attempts. > Four Attempts: > {code} > attempt_1530690553693_17267_4_02_000496_3 -> > container_e270_1530690553693_17267_01_014554 -> bigdata2.openstacklocal > attempt_1530690553693_17267_4_02_000496_2 -> > container_e270_1530690553693_17267_01_014423 -> bigdata3.openstacklocal > attempt_1530690553693_17267_4_02_000496_1 -> > container_e270_1530690553693_17267_01_014311 -> bigdata4.openstacklocal > attempt_1530690553693_17267_4_02_000496_0 -> > container_e270_1530690553693_17267_01_014613 -> bigdata5.openstacklocal > {code} > All the four attempts failed while fetching a same Map Output: > {code} > 2018-07-05 00:26:54,161 [WARN] [fetcher {Map_1} #51] > |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after > connecting to bigdata6.openstacklocal:13562 with 1 inputs pending > java.io.IOException: Server returned HTTP response code: 401 for URL: > http://bigdata6.openstacklocal:13562/mapOutput?job=job_1530690553693_17267&reduce=496&map=attempt_1530690553693_17267_4_01_000874_0_10003 > {code} > The failures are being reported back to the AM correctly in Tez, though it is > not reported as a "source unhealthy" because the NodeManager is healthy (due > to the cleanup). > {code} > 2018-07-04 23:47:42,344 [INFO] [fetcher {Map_1} #10] > |orderedgrouped.ShuffleScheduler|: Map_1: Reporting fetch failure for > InputIdentifier: InputAttemptIdentifier [inputIdentifier=InputIdentifier > [inputIndex=874], attemptNumber=0, > pathComponent=ttempt_1530690553693_17267_4_01_000874_0_10003, spillType=0, > spillId=-1] taskAttemptIdentifier: Map 1_000874_00 to AM. > {code} > There are approximated 460 errors reported back to the AM like this, which > keeps getting marked as "fetcher unhealthy" which is probably because the > restarted NM showed up as healthy. > This scenario of shuffle failures are not handled as NM showed up as healthy. > Mapper (source InputIdentifier ) has to be marked as unhealthy and rerun. -- This message was sent by Atlassian JIRA (v7.6.3#76005)