Craig Welch created MAPREDUCE-6252:
--------------------------------------
Summary: JobHistoryServer should not fail when encountering a
missing directory
Key: MAPREDUCE-6252
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6252
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: Craig Welch
The JobHistoryServer maintains a cache of job serial number parts to dfs paths
which it uses when seeking a job it no longer has in it's memory cache,
multiple directories for a given serial number differentiated by time stamp.
At present the jobhistory server will fail any time it attempts to find a job
in a directory which no longer exists based on that cache - even though the job
may well exist in a different directory for the serial number. Typically this
is not an issue, but the history cleanup process removes the directory from dfs
before removing it from the cache which leaves a window of time where a
directory may be missing from dfs which is present in the cache, resulting in
failure. For some dfs's it appears that the top level directory may become
unavailable some time before the full deletion of the tree completes which
extends what might otherwise be a brief period of failure to a more extended
period. Further, this also places the service at the mercy of outside
processes which might remove those directories. The proposal is simply to make
the server resistant to this state such that encountering this missing
directory is not fatal and the process will continue on to seek it elsewhere.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)