[jira] [Commented] (MAPREDUCE-2413) TaskTracker should handle disk failures at both startup and runtime

Eli Collins (JIRA) Wed, 31 Aug 2011 18:53:38 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095073#comment-13095073
 ]


Eli Collins commented on MAPREDUCE-2413:
----------------------------------------

Here's review feedback on the patch that was committed:

* LocalStorage should not be public, adding a method in UtilsForTests will 
allow it to have package protection
* This is a larger issue, but LocalStorage doesn't need to be tied to MR (see 
HADOOP-7551)
* getBadLocalDirs and the array of bad dirs are dead code, should be removed
* TT#getLocalStorage is dead code too
* getGoodLocalDirsString should not reimplement StringUtils#join. A better name 
would be getDirs as we know it returns local dirs and it's should only return 
good dirs, ie all the callers should use it to get a list of local dirs to 
alloc from vs having to care if they're good or bad.
* The LocalStorage#isDiskFailed method is goofy, this would be cleaner if it 
just returned the number of valid directories and then the code below would 
return STALE if the number of good dirs changed since it last checked.

> TaskTracker should handle disk failures at both startup and runtime
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2413
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: task-controller, tasktracker
>    Affects Versions: 0.20.204.0
>            Reporter: Bharath Mundlapudi
>            Assignee: Ravi Gummadi
>             Fix For: 0.20.204.0
>
>         Attachments: MR-2413.v0.1.patch, MR-2413.v0.2.patch, 
> MR-2413.v0.3.patch, MR-2413.v0.patch
>
>
> At present, TaskTracker doesn't handle disk failures properly both at startup 
> and runtime.
> (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is 
> on a bad disk. TaskTracker should ignore that particular mapred-local-dir and 
> start up and use only the remaining good mapred-local-dirs.
> (2) If a disk goes bad while TaskTracker is running, currently TaskTracker 
> doesn't do anything special. This results in either
>    (a) TaskTracker continues to "try to use that bad disk" and this results 
> in lots of task failures and possibly job failures(because of multiple TTs 
> having bad disks) and eventually these TTs getting graylisted for all jobs. 
> And this needs manual restart of TT with modified configuration of 
> mapred-local-dirs avoiding the bad disk. OR
>    (b) Health check script identifying the disk as bad and the TT gets 
> blacklisted. And this also needs manual restart of TT with modified 
> configuration of mapred-local-dirs avoiding the bad disk.
> This JIRA is to make TaskTracker more fault-tolerant to disk failures solving 
> (1) and (2). i.e. TT should start even if at least one of the 
> mapred-local-dirs is on a good disk and TT should adjust its in-memory list 
> of mapred-local-dirs and avoid using bad mapred-local-dirs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2413) TaskTracker should handle disk failures at both startup and runtime

Reply via email to