[ https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134735#comment-13134735 ]
Ravi Gummadi commented on MAPREDUCE-3121: ----------------------------------------- Yes. Adding a TimerTask that checks periodically nm-local-dirs and nm-log-dirs. This component maintains the list of nm-local-dirs and the list of nm-log-dirs. Everybody accesses nm-local-dirs and nm-log-dirs from this component. Very little dependency on the NM health checker script. Script should not return error(s) when disk failures are identified(especially when there are some good disks). This behavior is similar to what is there in 0.20(i.e. MR1). Disks coming back again after failure can be supported as a later enhancement. This JIRA gets mostly similar to 0.20 behavior. As part of this JIRA, AM is similar to any other container and nothing is done as part of this JIRA. We can think of enhancing the behavior later, if really needed. Once RM is enhanced to consider disks in its allocation policy, we can enhance this JIRA's work by propagating the info of disks' health from NM to RM. So for now, am not planning to change RM as part of this JIRA. Am planning to have the basic unit test(s) similar to the patch of MR2850. Let us see if something better can be done. > NodeManager should handle disk-failures > --------------------------------------- > > Key: MAPREDUCE-3121 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager > Affects Versions: 0.23.0 > Reporter: Vinod Kumar Vavilapalli > Assignee: Ravi Gummadi > Fix For: 0.23.0 > > > This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to > minimize the impact of transient/permanent disk failures on containers. With > larger number of disks per node, the ability to continue to run containers on > other disks is crucial. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira