[jira] [Updated] (MAPREDUCE-2413) TaskTracker should handle disk failures at both startup and runtime

Ravi Gummadi (JIRA) Thu, 31 Mar 2011 13:53:45 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ravi Gummadi updated MAPREDUCE-2413:
------------------------------------

    Attachment: MR-2413.v0.patch

Attaching patch solving the 2 issues mentioned in the JIRA description.

The patch does the following:

(1) TaskTracker maintains good mapred-local-dirs list and bad mapred-local-dirs 
list.
(2) When TT is starting up, all mapred-local-dirs are checked if they are on 
good disks or not. This updates the good dirs list and bad dirs list.
(3) TaskTracker periodically checks the health of good mapred-local-dirs. If 
any good mapred-local-dir becomes bad, then TaskTracker reinitilizes itself. So 
the effect at TaskTracker side is similar to getting ReinitTrackerAction from 
JobTracker. In the currently existing code, JobTracker sends 
ReinitTrackerAction when it finds that this TaskTracker was lost some time back 
and came back now.
(4) A new configuration property mapred.disk.healthChecker.interval (whose 
value is in milli sec) is added with a default value of 60000. This is the 
interval between 2 consecutive checks of health of mapred-local-dirs by 
TaskTracker.
(5) Task Tracker's in-memory configuration is also updated everytime 
initialize() happens. Correct configuration value for mapred.local.dir in 
tasks' configurations is set before launching tasks.
(6) TaskTracker passes the list of good mapred-local-dirs to Linux Task 
Controller binary as a parameter(comma separated list). Linux Task Controller 
uses this good mapred-local-dirs only. So with this patch, Linux Task 
Controller's configuration file taskcontroller.cfg doesn't have to contain 
mapred.local.dir. Even if taskcontroller.cfg contains mapred.loca.dir, it is 
just ignored by Linux Task Controller.
------------------------------------------------------------------
With this patch,

What happens when a disk failed and before TaskTracker reinits itself ?

Currently running tasks and tasks that are getting launched now which try to 
use the bad disk can fail.

What happens after TT re-initialization ?

All the mapred-local-dirs are cleaned up during re-initialization. So running 
tasks can fail because of this clean up. All finished maps of those jobs whose 
reduces still haven't fetched these maps' outputs will also fail with "too many 
fetch failures" error because all these maps' outputs are also cleaned up and 
thus this TaskTracker can't serve these maps' outputs to reduces.

> TaskTracker should handle disk failures at both startup and runtime
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2413
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task-controller, tasktracker
>    Affects Versions: 0.20.204.0
>            Reporter: Bharath Mundlapudi
>            Assignee: Ravi Gummadi
>             Fix For: 0.20.204.0
>
>         Attachments: MR-2413.v0.patch
>
>
> At present, TaskTracker doesn't handle disk failures properly both at startup 
> and runtime.
> (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is 
> on a bad disk. TaskTracker should ignore that particular mapred-local-dir and 
> start up and use only the remaining good mapred-local-dirs.
> (2) If a disk goes bad while TaskTracker is running, currently TaskTracker 
> doesn't do anything special. This results in either
>    (a) TaskTracker continues to "try to use that bad disk" and this results 
> in lots of task failures and possibly job failures(because of multiple TTs 
> having bad disks) and eventually these TTs getting graylisted for all jobs. 
> And this needs manual restart of TT with modified configuration of 
> mapred-local-dirs avoiding the bad disk. OR
>    (b) Health check script identifying the disk as bad and the TT gets 
> blacklisted. And this also needs manual restart of TT with modified 
> configuration of mapred-local-dirs avoiding the bad disk.
> This JIRA is to make TaskTracker more fault-tolerant to disk failures solving 
> (1) and (2). i.e. TT should start even if at least one of the 
> mapred-local-dirs is on a good disk and TT should adjust its in-memory list 
> of mapred-local-dirs and avoid using bad mapred-local-dirs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2413) TaskTracker should handle disk failures at both startup and runtime

Reply via email to