[
https://issues.apache.org/jira/browse/MAPREDUCE-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated MAPREDUCE-5042:
----------------------------------
Attachment: MAPREDUCE-5042.patch
This is complicated by the fact that the job token currently serves a dual-role
to authenticate both the shuffle *and* the task umbilical. The former is
something that should persist across app attempts, while the latter should not.
We don't want old task attempts authenticating with the new app attempt, at
least not at this point. It would only serve to confuse the new app attempt.
Therefore I propose the following:
* The current job token remains primarily as-is for the authenticating of the
task umbilical, and each AM attempt continues to generate its own job token.
* A new secret key, the shuffle secret, will be generated by the job client
when the job is submitted as part of the job's credentials. Each app attempt
will extract the shuffle secret from the job's credentials and use it as the
shared secret to authenticate the shuffle
Attaching the first draft of a patch that implements that proposal. It needs
unit tests, but I've manually tested that it can recover map tasks and
successfully shuffle their data.
> Reducer unable to fetch for a map task that was recovered
> ---------------------------------------------------------
>
> Key: MAPREDUCE-5042
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5042
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mr-am, security
> Affects Versions: 0.23.7, 2.0.4-beta
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-5042.patch
>
>
> If an application attempt fails and is relaunched the AM will try to recover
> previously completed tasks. If a reducer needs to fetch the output of a map
> task attempt that was recovered then it will fail with a 401 error like this:
> {noformat}
> java.io.IOException: Server returned HTTP response code: 401 for URL:
> http://xx:xx/mapOutput?job=job_1361569180491_21845&reduce=0&map=attempt_1361569180491_21845_m_000016_0
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1615)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:231)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:156)
> {noformat}
> Looking at the corresponding NM's logs, we see the shuffle failed due to
> "Verification of the hashReply failed".
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira