[jira] [Updated] (MAPREDUCE-5042) Reducer unable to fetch for a map task that was recovered

Jason Lowe (JIRA) Tue, 05 Mar 2013 17:58:18 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe updated MAPREDUCE-5042:
----------------------------------

    Attachment: MAPREDUCE-5042.patch

This is complicated by the fact that the job token currently serves a dual-role 
to authenticate both the shuffle *and* the task umbilical.  The former is 
something that should persist across app attempts, while the latter should not. 
 We don't want old task attempts authenticating with the new app attempt, at 
least not at this point.  It would only serve to confuse the new app attempt.

Therefore I propose the following:

* The current job token remains primarily as-is for the authenticating of the 
task umbilical, and each AM attempt continues to generate its own job token.
* A new secret key, the shuffle secret, will be generated by the job client 
when the job is submitted as part of the job's credentials.  Each app attempt 
will extract the shuffle secret from the job's credentials and use it as the 
shared secret to authenticate the shuffle

Attaching the first draft of a patch that implements that proposal.  It needs 
unit tests, but I've manually tested that it can recover map tasks and 
successfully shuffle their data.
                
> Reducer unable to fetch for a map task that was recovered
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-5042
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5042
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, security
>    Affects Versions: 0.23.7, 2.0.4-beta
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-5042.patch
>
>
> If an application attempt fails and is relaunched the AM will try to recover 
> previously completed tasks.  If a reducer needs to fetch the output of a map 
> task attempt that was recovered then it will fail with a 401 error like this:
> {noformat}
> java.io.IOException: Server returned HTTP response code: 401 for URL: 
> http://xx:xx/mapOutput?job=job_1361569180491_21845&reduce=0&map=attempt_1361569180491_21845_m_000016_0
>       at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1615)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:231)
>       at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:156)
> {noformat}
> Looking at the corresponding NM's logs, we see the shuffle failed due to 
> "Verification of the hashReply failed".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-5042) Reducer unable to fetch for a map task that was recovered

Reply via email to