[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated MAPREDUCE-6252:
-----------------------------------
    Description: The JobHistoryServer maintains a cache of job serial number 
parts to dfs paths which it uses when seeking a job it no longer has in its 
memory cache, multiple directories for a given serial number differentiated by 
time stamp.  At present the jobhistory server will fail any time it attempts to 
find a job in a directory which no longer exists based on that cache - even 
though the job may well exist in a different directory for the serial number.  
Typically this is not an issue, but the history cleanup process removes the 
directory from dfs before removing it from the cache which leaves a window of 
time where a directory may be missing from dfs which is present in the cache, 
resulting in failure.  For some dfs's it appears that the top level directory 
may become unavailable some time before the full deletion of the tree completes 
which extends what might otherwise be a brief period of failure to a more 
extended period.  Further, this also places the service at the mercy of outside 
processes which might remove those directories.  The proposal is simply to make 
the server resistant to this state such that encountering this missing 
directory is not fatal and the process will continue on to seek it elsewhere.  
(was: The JobHistoryServer maintains a cache of job serial number parts to dfs 
paths which it uses when seeking a job it no longer has in it's memory cache, 
multiple directories for a given serial number differentiated by time stamp.  
At present the jobhistory server will fail any time it attempts to find a job 
in a directory which no longer exists based on that cache - even though the job 
may well exist in a different directory for the serial number.  Typically this 
is not an issue, but the history cleanup process removes the directory from dfs 
before removing it from the cache which leaves a window of time where a 
directory may be missing from dfs which is present in the cache, resulting in 
failure.  For some dfs's it appears that the top level directory may become 
unavailable some time before the full deletion of the tree completes which 
extends what might otherwise be a brief period of failure to a more extended 
period.  Further, this also places the service at the mercy of outside 
processes which might remove those directories.  The proposal is simply to make 
the server resistant to this state such that encountering this missing 
directory is not fatal and the process will continue on to seek it elsewhere.)

> JobHistoryServer should not fail when encountering a missing directory
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.6.0
>            Reporter: Craig Welch
>            Assignee: Craig Welch
>         Attachments: MAPREDUCE-6252.0.patch
>
>
> The JobHistoryServer maintains a cache of job serial number parts to dfs 
> paths which it uses when seeking a job it no longer has in its memory cache, 
> multiple directories for a given serial number differentiated by time stamp.  
> At present the jobhistory server will fail any time it attempts to find a job 
> in a directory which no longer exists based on that cache - even though the 
> job may well exist in a different directory for the serial number.  Typically 
> this is not an issue, but the history cleanup process removes the directory 
> from dfs before removing it from the cache which leaves a window of time 
> where a directory may be missing from dfs which is present in the cache, 
> resulting in failure.  For some dfs's it appears that the top level directory 
> may become unavailable some time before the full deletion of the tree 
> completes which extends what might otherwise be a brief period of failure to 
> a more extended period.  Further, this also places the service at the mercy 
> of outside processes which might remove those directories.  The proposal is 
> simply to make the server resistant to this state such that encountering this 
> missing directory is not fatal and the process will continue on to seek it 
> elsewhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to