NM disk failure detection only covers local dirs
-------------------------------------------------
Key: MAPREDUCE-3474
URL: https://issues.apache.org/jira/browse/MAPREDUCE-3474
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: nodemanager, tasktracker
Affects Versions: 0.23.0, 0.20.205.0
Reporter: Eli Collins
This is the MR counterpart to HDFS-1848. Like HDFS volume failure detection, NM
disk failure detection checks a subset of the disks, and a subset of the
directories. Eg the TT and the NM do not check the root disk for errors unless
a local dir resides on them. Even if a local dir resides on the root disk the
disk checking code only checks the local dirs so a failure only seen when
accessing a part of the disk no hosting the local dirs will not be noticed. The
disk that hosts the logs, pid, tmp dirs etc is critical, so if needs to be
checked as well, and the NM should shutdown if a critical disk is not available
(to prevent MR issues similar to HDFS-1848 and HDFS-2095). Typically people
currently work around this limitation by (aside from ignoring it) by using
raid-1 for the root disk or a health script that checks the root disk health.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira