[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134413#comment-13134413
 ] 

Eli Collins commented on MAPREDUCE-3121:
----------------------------------------

Will failure detection to work similarly to MR1, ie there's a periodic local 
dir checker thread that marks directories as failed? In MR1 attempts to access 
local dirs by a container that fail do not necesarily result in the local dir 
being marked as failed. Since eg a failure that causes a write to a log to fail 
may not cause the dir checking to fail. If the key places that frequently 
access local disk - logging and writing intermediate data - handle disk failure 
than the container can fail-fast and we may identify failures that otherwise go 
unnoticed.

Will there be any dependency on the NM health checker script?  

Can disk failures be considered transient? Do we want to support disks coming 
back online? In MR1 a disk failure means the disk is blacklisted until TT 
restart. 

Is the AM treated like a container as well? Ie it's  allowed to run and fail 
and the NM will restart it?
 
IIUC the RM doesn't currently consider disks in it's allocation policy, 
therefore the "major percentage" that's needed to offline the NM should take 
this into account right? This is similar to the issue in MR1 where we don't 
throttle slots based on the number of available disks.

What's the plan for testing? Adding some basic fault injection (either manually 
as in the NN or via a framework) to the commonly used paths would make testing 
easier than what we do in MR1. 
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.0
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to