[
https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101623#comment-13101623
]
Owen O'Malley commented on MAPREDUCE-2413:
------------------------------------------
Eli,
It isn't unreasonable to have a TT without a DN or the other way around. I
agree that we should make symmetric config knobs so that if someone has them
tuned differently they did it explicitly. (In reality, I think the
failed.volumes.tolerated is a mistake and we need to move to a list of required
partitions and everything else is optional. Even a node with a single good
drive can do useful work and getting it to do something would be good.
(Although we should also scale down the number of tasks/containers scheduled on
such a node...)
> TaskTracker should handle disk failures at both startup and runtime
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-2413
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Components: task-controller, tasktracker
> Affects Versions: 0.20.204.0
> Reporter: Bharath Mundlapudi
> Assignee: Ravi Gummadi
> Fix For: 0.20.204.0
>
> Attachments: MR-2413.v0.1.patch, MR-2413.v0.2.patch,
> MR-2413.v0.3.patch, MR-2413.v0.patch
>
>
> At present, TaskTracker doesn't handle disk failures properly both at startup
> and runtime.
> (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is
> on a bad disk. TaskTracker should ignore that particular mapred-local-dir and
> start up and use only the remaining good mapred-local-dirs.
> (2) If a disk goes bad while TaskTracker is running, currently TaskTracker
> doesn't do anything special. This results in either
> (a) TaskTracker continues to "try to use that bad disk" and this results
> in lots of task failures and possibly job failures(because of multiple TTs
> having bad disks) and eventually these TTs getting graylisted for all jobs.
> And this needs manual restart of TT with modified configuration of
> mapred-local-dirs avoiding the bad disk. OR
> (b) Health check script identifying the disk as bad and the TT gets
> blacklisted. And this also needs manual restart of TT with modified
> configuration of mapred-local-dirs avoiding the bad disk.
> This JIRA is to make TaskTracker more fault-tolerant to disk failures solving
> (1) and (2). i.e. TT should start even if at least one of the
> mapred-local-dirs is on a good disk and TT should adjust its in-memory list
> of mapred-local-dirs and avoid using bad mapred-local-dirs.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira