[jira] Commented: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

Bhallamudi Venkata Siva Kamesh (JIRA) Thu, 17 Feb 2011 07:19:50 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995845#comment-12995845
 ]


Bhallamudi Venkata Siva Kamesh commented on MAPREDUCE-1213:
-----------------------------------------------------------

While analyzing the patch, I found an issue, The below moveAndDelete method is 
called from both jobTracker and TaskTracker. JobTracker calls the below snippet 
on it's JobTracker folder and TaskTracker on it's TaskTracker folder(ex: 
/home/hadoop/tasktracker/local). This method renames the current folder and 
deletes it asynchronously. Let us assume the deletion step failed due to some 
reason (Like abrupt kill or some thing else), then the renamed folders are 
never deleted by any one. 




{code:title=MRAsyncDiskService.java|borderStyle=solid}

public boolean moveAndDelete(String volume, String pathName) throws IOException 
{
    // Move the file right now, so that it can be deleted later
    String newPathName;
    synchronized (this) {
      newPathName = format.format(new Date()) + "_" + uniqueId;
      uniqueId ++;
    }
    newPathName = SUBDIR + Path.SEPARATOR_CHAR + newPathName;

    Path source = new Path(volume, pathName);
    Path target = new Path(volume, newPathName);
    try {
      if (!localFileSystem.rename(source, target)) {
        return false;
      }
    } catch (FileNotFoundException e) {
      // Return false in case that the file is not found.
      return false;
    }
    DeleteTask task = new DeleteTask(volume, pathName, newPathName);
    execute(volume, task);
    return true;
  }
{code}

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1213
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>             Fix For: 0.21.0
>
>         Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch, 
> MAPREDUCE-1213.branch-0.20.2.patch, MAPREDUCE-1213.branch-0.20.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

Reply via email to