[ 
https://issues.apache.org/jira/browse/HADOOP-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-1252:
--------------------------------

    Attachment: 1252.patch

Changed the patch to have the disk logic in Configuration.java. That way things 
remain mostly as they already are (in terms of uses of the getLocalPath API). 
So now Configuration.java has a new class called DiskAllocator that exposes 
public methods getLocalPathToRead(path), getLocalPathForWrite(path, size), 
getLocalPathForWrite(path). I haven't touched the part of the code that does 
caching and downloads jar files from the dfs, etc. This patch is meant to 
handle the disk failures better for the map outputs, shuffle, merge, and final 
reduce inputs. Making changes to the caching part to tolerate disk failures can 
probably be a separate jira issue.

> Disk problems should be handled better by the MR framework
> ----------------------------------------------------------
>
>                 Key: HADOOP-1252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1252
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>             Fix For: 0.13.0
>
>         Attachments: 1252.patch, 1252.patch
>
>
> The MR framework should recover from Disk Failure problems without causing 
> jobs to hang. Note that this issue is about a short-term solution to solving 
> the problem. For example, by looking at the code and improving the exception 
> handling (to better detect faulty disks and missing files). The long term 
> approach might be to have a FS layer that takes care of failed disks and 
> makes it transparent to the tasks. That will be a separate issue by itself.
> Some of the issues that have been reported are HADOOP-1087 and a comment by 
> Koji on HADOOP-1200 (not sure whether those are all). Please add to this 
> issue as much details as possible on disk failures leading to hung jobs 
> (details like relevant exception traces, way to reproduce, etc.).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to