[ https://issues.apache.org/jira/browse/HADOOP-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated HADOOP-1046: -------------------------------------- Attachment: fsdataset.patch Patch relative to trunk/ > Datanode should periodically clean up /tmp from partially received (and not > completed) block files > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-1046 > URL: https://issues.apache.org/jira/browse/HADOOP-1046 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.9.2, 0.12.0 > Environment: Cluster of 10 machines, running Hadoop 0.9.2 + Nutch > Reporter: Andrzej Bialecki > Attachments: fsdataset.patch > > > Cluster is set up with tasktrackers running on the same machines as > datanodes. Tasks create heavy load in terms of local CPU/RAM/diskIO. I > noticed a lot of the following messages from the datanodes in such situations: > 2007-02-15 05:30:53,298 WARN dfs.DataNode - Failed to transfer > blk_-4590782726923911824 to xxx.xxx.xxx/10.10.16.109:50010 > java.net.SocketException: Connection reset > .... > java.io.IOException: Block blk_71053993347675204 has already been started > (though not completed), and thus cannot be created. > My reading of the code in DataNode.DataXceiver.writeBlock() and > FSDataset.writeToBlock() + FSDataset.java:459 suggests the following > scenario: there is no cleanup of temporary files in /tmp that are used to > store the incomplete blocks being transferred. If the datanode is CPU-starved > and drops the connection while creating this temp file, the source datanode > will attempt to transfer it again - but there is already a file under this > name in /tmp, because when the connection was dropped the target datanode > didn't bother to cleanup. > I also see that this section is unchanged in trunk/. > The solution to this would be to check the age of the physical file in the > /tmp dir, in FSDataset.java:436 - if it's older than a few hours or so, we > should delete it and proceed as if there were no ongoing create op for this > block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.