[ 
https://issues.apache.org/jira/browse/TEZ-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538878#comment-16538878
 ] 

Jonathan Eagles commented on TEZ-3968:
--------------------------------------

[~Prabhu Joseph], couple of things to clarify and hopefully to help provide a 
fix in Tez (TEZ-3910 as [~kshukla] referred).
- You have filed this jira with an affected version of tez 0.7.1. This version 
was released May 10, 2016 and no further versions are planned against the 0.7 
line. If a fix is provided against 0.9, will that be sufficient?
- In addition, the error message in the NM log, _Can't find job token for job 
job_1530690553693_17267_, indicates that the node manager was restarted (either 
by design or in error) and that the job token wasn't found in the Node Manager 
state store. Do you have the node manager state store enabled for this cluster? 
In addition, we have seen cases where the restart of the NM has failed (due to 
port conflict in my most recent case) and shuts down forcefully where it 
deletes all Job Tokens [~jlowe] can confirm this. To this further thread it is 
recommended to ensure ephemeral ports (like the AM port 
tez.am.client.am.port-range should be out of the reserved ports and away from 
the NM web and shuffle ports) don't have a high likelihood of collision with 
the nm port.

If you can report back the reason for the loss of NM state store tokens, not 
only can we provide a fix as part of TEZ-3910, but we can help your setup so 
your environment avoids needing the fix.

> Tez Job Fails with Shuffle failures too fast when NM returns a 401 error
> ------------------------------------------------------------------------
>
>                 Key: TEZ-3968
>                 URL: https://issues.apache.org/jira/browse/TEZ-3968
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.7.1
>            Reporter: Prabhu Joseph
>            Priority: Major
>
> Tez Job failed with a reduce task failed on all four attempts while fetching 
> a particular map output from a Node. NodeManager where MapTask has succeeded 
> was stopped and got NM local directories cleared and started again (as disks 
> were full). This has caused the shuffle failure in NodeManager as there is no 
> Job Token found.
> NodeManager Logs shows reason for Shuffle Failure:
> {code}
> 2018-07-05 00:26:00,371 WARN  mapred.ShuffleHandler 
> (ShuffleHandler.java:messageReceived(947)) - Shuffle failure
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't find job 
> token for job job_1530690553693_17267 !!
>         at 
> org.apache.hadoop.mapreduce.security.token.JobTokenSecretManager.retrieveTokenSecret(JobTokenSecretManager.java:112)
>         at 
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.verifyRequest(ShuffleHandler.java:1133)
>         at 
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:944)
> {code}
> Analysis of Application Logs:
> Application application_1530690553693_17267 failed with task 
> task_1530690553693_17267_4_02_000496 failed on all four attempts.
> Four Attempts:
> {code}
> attempt_1530690553693_17267_4_02_000496_3 -> 
> container_e270_1530690553693_17267_01_014554 -> bigdata2.openstacklocal
> attempt_1530690553693_17267_4_02_000496_2 -> 
> container_e270_1530690553693_17267_01_014423 -> bigdata3.openstacklocal
> attempt_1530690553693_17267_4_02_000496_1 -> 
> container_e270_1530690553693_17267_01_014311 -> bigdata4.openstacklocal
> attempt_1530690553693_17267_4_02_000496_0 -> 
> container_e270_1530690553693_17267_01_014613 -> bigdata5.openstacklocal
> {code}
> All the four attempts failed while fetching a same Map Output:
> {code}
> 2018-07-05 00:26:54,161 [WARN] [fetcher {Map_1} #51] 
> |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after 
> connecting to bigdata6.openstacklocal:13562 with 1 inputs pending
> java.io.IOException: Server returned HTTP response code: 401 for URL: 
> http://bigdata6.openstacklocal:13562/mapOutput?job=job_1530690553693_17267&reduce=496&map=attempt_1530690553693_17267_4_01_000874_0_10003
> {code}
> The failures are being reported back to the AM correctly in Tez, though it is 
> not reported as a "source unhealthy" because the NodeManager is healthy (due 
> to the cleanup).
> {code}
> 2018-07-04 23:47:42,344 [INFO] [fetcher {Map_1} #10] 
> |orderedgrouped.ShuffleScheduler|: Map_1: Reporting fetch failure for 
> InputIdentifier: InputAttemptIdentifier [inputIdentifier=InputIdentifier 
> [inputIndex=874], attemptNumber=0, 
> pathComponent=ttempt_1530690553693_17267_4_01_000874_0_10003, spillType=0, 
> spillId=-1] taskAttemptIdentifier: Map 1_000874_00 to AM.
> {code}
> There are approximated 460 errors reported back to the AM like this, which 
> keeps getting marked as "fetcher unhealthy" which is probably because the 
> restarted NM showed up as healthy.
> This scenario of shuffle failures are not handled as NM showed up as healthy. 
> Mapper (source InputIdentifier ) has to be marked as unhealthy and rerun.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to