[
https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096076#comment-13096076
]
Eli Collins commented on MAPREDUCE-2413:
----------------------------------------
Thanks for the update Bharath. Could you share the functional tests that your
QA team wrote? How will other developers know whether they broke this feature?
In your experiments, does a machine with only a single functioning disk warrant
staying up? I suspect tasks on this machine will perform poorly. I suspect at
Yahoo! you're using some configuration that blacklists a TT after X disk
failures. If someone isn't using such a configuration their cluster will
perform poorly.
Did you guys test both the default and link task controllers?
> TaskTracker should handle disk failures at both startup and runtime
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-2413
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Components: task-controller, tasktracker
> Affects Versions: 0.20.204.0
> Reporter: Bharath Mundlapudi
> Assignee: Ravi Gummadi
> Fix For: 0.20.204.0
>
> Attachments: MR-2413.v0.1.patch, MR-2413.v0.2.patch,
> MR-2413.v0.3.patch, MR-2413.v0.patch
>
>
> At present, TaskTracker doesn't handle disk failures properly both at startup
> and runtime.
> (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is
> on a bad disk. TaskTracker should ignore that particular mapred-local-dir and
> start up and use only the remaining good mapred-local-dirs.
> (2) If a disk goes bad while TaskTracker is running, currently TaskTracker
> doesn't do anything special. This results in either
> (a) TaskTracker continues to "try to use that bad disk" and this results
> in lots of task failures and possibly job failures(because of multiple TTs
> having bad disks) and eventually these TTs getting graylisted for all jobs.
> And this needs manual restart of TT with modified configuration of
> mapred-local-dirs avoiding the bad disk. OR
> (b) Health check script identifying the disk as bad and the TT gets
> blacklisted. And this also needs manual restart of TT with modified
> configuration of mapred-local-dirs avoiding the bad disk.
> This JIRA is to make TaskTracker more fault-tolerant to disk failures solving
> (1) and (2). i.e. TT should start even if at least one of the
> mapred-local-dirs is on a good disk and TT should adjust its in-memory list
> of mapred-local-dirs and avoid using bad mapred-local-dirs.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira