[
https://issues.apache.org/jira/browse/MAPREDUCE-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514319#comment-14514319
]
Craig Welch commented on MAPREDUCE-6251:
----------------------------------------
bq. It is not just getJob(). Clients may also make calls to isSuccessful() /
getCounters() etc and may run into the same failure. We should wrap them all up
in a generic retry..
Those are all calls that happen on the job we are adding retries for here, so
they are getting the benefit of the retry. In fact, anything of this sort
looks to go through this getJob bit, so it is such a center point
bq. Does it also make sense for the server to throw a specific exception when
this happens instead of retrying everytime?
I don't believe so - callers are expecting to get a response for valid queries
- e.g., a job which exists and is not aged-out by history, changing the kind of
error they get won't help, they're expecting a response, this is to make sure
they get one in all but the most extraordinary circumstances (not unreasonable)
With this change they appear to effectively always get one, without it there
are a small but significant number of cases where they do not.
> JobClient needs additional retries at a higher level to address
> not-immediately-consistent dfs corner cases
> -----------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6251
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6251
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver, mrv2
> Affects Versions: 2.6.0
> Reporter: Craig Welch
> Assignee: Craig Welch
> Attachments: MAPREDUCE-6251.0.patch, MAPREDUCE-6251.1.patch,
> MAPREDUCE-6251.2.patch
>
>
> The JobClient is used to get job status information for running and completed
> jobs. Final state and history for a job is communicated from the application
> master to the job history server via a distributed file system - where the
> history is uploaded by the application master to the dfs and then
> scanned/loaded by the jobhistory server. While HDFS has strong consistency
> guarantees not all Hadoop DFS's do. When used in conjunction with a
> distributed file system which does not have this guarantee there will be
> cases where the history server may not see an uploaded file, resulting in the
> dreaded "no such job" and a null value for the RunningJob in the client.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)