[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143873#comment-13143873
 ] 

Ravi Gummadi commented on MAPREDUCE-3121:
-----------------------------------------

Code level summary of the patch attached:

(1) Node Manager launches DiskHealthCheckeraService that periodically 
launches/executes the disk-health-check-code. A new configuration property 
yarn.nodemanger.disk-health-checker.interval-ms is added for the frequency with 
which this code is executed, with a default value of 120*1000ms(i.e. 2minutes).

(2) LocalStorage is a new class that manages a list of local file system 
directories and provides api for checking the health of those directories 
(mostly similar to TaskTracker.LocalStorgae class of 0.20 except that this 
class' checkDirs() doesn't throw DiskErrorException when all directories fail 
but returns true if a new disk-failure is seen.

(3) DiskHealthCheckerService maintains 2 LocalStorage objects ---- one for 
nm-local-dirs and second one for nm-log-dirs.

(4) ContainerExecutor is initialized with the DiskHealthCheckerService object. 
So both DefaultContainerExecutor.java and LinuxContainerExecutor.java get good 
nm-local-dirs and nm-log-dirs from the DiskHealthChecker always.

(5) container-executor binary gets good nm-local-dirs and good nm-log-dirs as a 
parameter and uses these good dirs only. So these are removed from the 
configuration file(i.e. not needed to be configured in container-executor.cfg 
configuration file).

(6) Whenever a new container gets launched, the good nm-local-dirs and good 
nm-log-dirs are updated in the configuration sothat containers won't access bad 
disks. Everybody (localizer, webserver) goes through DiskHealthChecker to 
access nm-local-dirs and nm-log-dirs.

(7) On the NodeManager web UI, NodeHealthReport will be showing the lost of 
good nm-local-dirs and nm-log-dirs in addition to true/false about the health 
of the node.

(8) A new unit test TestDiskFailures is added that makes disks(both 
nm-local-dirs and nm-log-dirs) fail and checks/validates if the 
NodeManager/DiskHealthChecker can identifies these disk-failures or not.

\\
Tested the patch with (1) DefaultContainerExecutor and (2) 
LinuxContainerExecutor on my single node cluster. The functionality seems to be 
working fine with disk failures getting identified by NodeManager and the bad 
nm-local-dirs and bad nm-log-dirs getting avoided for new containers.
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.1
>
>         Attachments: 3121.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to