[
https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095774#comment-13095774
]
Bharath Mundlapudi commented on MAPREDUCE-2413:
-----------------------------------------------
Hi Eli,
>> What testing was done with this change before it was committed?
There was tremendous testing went into testing these patches. We have tested
this feature at many levels.
Here are the things we tested.
1. Simulating disk failures.
2. Randomly makings disk read-only via mounting.
3. Randomly making directory read/write only.
4. Our QA team has written more functional tests.
5. There was lots of manual verification of this feature.
6. We have run Terasort and Gridmixv3 for testing verification with disk
failures.
There was huge effort went into this feature. Many many nam-hours of testing
went into this.
>> TT should start even if at least one of the mapred-local-dirs is on a good
>> disk
Having configurable option for this might be good idea. But the rationale for
this decision - Something is better than nothing. If we have one disk to run
TT, why not utilize the compute capacity on this machine. Since certain
percentage of our cluster runs with cpu intensive jobs too.
Let me know if you need any further explanation.
> TaskTracker should handle disk failures at both startup and runtime
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-2413
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Components: task-controller, tasktracker
> Affects Versions: 0.20.204.0
> Reporter: Bharath Mundlapudi
> Assignee: Ravi Gummadi
> Fix For: 0.20.204.0
>
> Attachments: MR-2413.v0.1.patch, MR-2413.v0.2.patch,
> MR-2413.v0.3.patch, MR-2413.v0.patch
>
>
> At present, TaskTracker doesn't handle disk failures properly both at startup
> and runtime.
> (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is
> on a bad disk. TaskTracker should ignore that particular mapred-local-dir and
> start up and use only the remaining good mapred-local-dirs.
> (2) If a disk goes bad while TaskTracker is running, currently TaskTracker
> doesn't do anything special. This results in either
> (a) TaskTracker continues to "try to use that bad disk" and this results
> in lots of task failures and possibly job failures(because of multiple TTs
> having bad disks) and eventually these TTs getting graylisted for all jobs.
> And this needs manual restart of TT with modified configuration of
> mapred-local-dirs avoiding the bad disk. OR
> (b) Health check script identifying the disk as bad and the TT gets
> blacklisted. And this also needs manual restart of TT with modified
> configuration of mapred-local-dirs avoiding the bad disk.
> This JIRA is to make TaskTracker more fault-tolerant to disk failures solving
> (1) and (2). i.e. TT should start even if at least one of the
> mapred-local-dirs is on a good disk and TT should adjust its in-memory list
> of mapred-local-dirs and avoid using bad mapred-local-dirs.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira