[
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995845#comment-12995845
]
Bhallamudi Venkata Siva Kamesh commented on MAPREDUCE-1213:
-----------------------------------------------------------
While analyzing the patch, I found an issue, The below moveAndDelete method is
called from both jobTracker and TaskTracker. JobTracker calls the below snippet
on it's JobTracker folder and TaskTracker on it's TaskTracker folder(ex:
/home/hadoop/tasktracker/local). This method renames the current folder and
deletes it asynchronously. Let us assume the deletion step failed due to some
reason (Like abrupt kill or some thing else), then the renamed folders are
never deleted by any one.
{code:title=MRAsyncDiskService.java|borderStyle=solid}
public boolean moveAndDelete(String volume, String pathName) throws IOException
{
// Move the file right now, so that it can be deleted later
String newPathName;
synchronized (this) {
newPathName = format.format(new Date()) + "_" + uniqueId;
uniqueId ++;
}
newPathName = SUBDIR + Path.SEPARATOR_CHAR + newPathName;
Path source = new Path(volume, pathName);
Path target = new Path(volume, newPathName);
try {
if (!localFileSystem.rename(source, target)) {
return false;
}
} catch (FileNotFoundException e) {
// Return false in case that the file is not found.
return false;
}
DeleteTask task = new DeleteTask(volume, pathName, newPathName);
execute(volume, task);
return true;
}
{code}
> TaskTrackers restart is very slow because it deletes distributed cache
> directory synchronously
> ----------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 0.20.1
> Reporter: dhruba borthakur
> Assignee: Zheng Shao
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch,
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch,
> MAPREDUCE-1213.branch-0.20.2.patch, MAPREDUCE-1213.branch-0.20.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively
> delete all the file in the distributed cache. It invoked
> FileUtil.fullyDelete() which is very very slow. This means that the
> TaskTracker cannot join the cluster for an extended period of time (upto 2
> hours for us). The problem is acute if the number of files in a distributed
> cache is a few-thousands.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira