[jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures

Ravi Gummadi (Commented) (JIRA) Mon, 24 Oct 2011 20:59:03 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134735#comment-13134735
 ]


Ravi Gummadi commented on MAPREDUCE-3121:
-----------------------------------------

Yes. Adding a TimerTask that checks periodically nm-local-dirs and nm-log-dirs. 
This component maintains the list of nm-local-dirs and the list of nm-log-dirs. 
Everybody accesses nm-local-dirs and nm-log-dirs from this component.

Very little dependency on the NM health checker script. Script should not 
return error(s) when disk failures are identified(especially when there are 
some good disks). This behavior is similar to what is there in 0.20(i.e. MR1).

Disks coming back again after failure can be supported as a later enhancement. 
This JIRA gets mostly similar to 0.20 behavior.

As part of this JIRA, AM is similar to any other container and nothing is done 
as part of this JIRA. We can think of enhancing the behavior later, if really 
needed.

Once RM is enhanced to consider disks in its allocation policy, we can enhance 
this JIRA's work by propagating the info of disks' health from NM to RM. So for 
now, am not planning to change RM as part of this JIRA.

Am planning to have the basic unit test(s) similar to the patch of MR2850. Let 
us see if something better can be done.
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.0
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3121) NodeManager should handle disk-failures

Reply via email to