[
https://issues.apache.org/jira/browse/HDFS-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653268#comment-13653268
]
Chris Nauroth commented on HDFS-4811:
-------------------------------------
Our story comes to a tragic conclusion in {{FSImage#renameCheckpointInDir}}:
{code}
private void renameCheckpointInDir(StorageDirectory sd, long txid)
throws IOException {
File ckpt = NNStorage.getStorageFile(sd, NameNodeFile.IMAGE_NEW, txid);
File curFile = NNStorage.getStorageFile(sd, NameNodeFile.IMAGE, txid);
// renameTo fails on Windows if the destination file
// already exists.
if(LOG.isDebugEnabled()) {
LOG.debug("renaming " + ckpt.getAbsolutePath()
+ " to " + curFile.getAbsolutePath());
}
if (!ckpt.renameTo(curFile)) {
if (!curFile.delete() || !ckpt.renameTo(curFile)) {
throw new IOException("renaming " + ckpt.getAbsolutePath() + " to " +
curFile.getAbsolutePath() + " FAILED");
}
}
}
{code}
Expanding on my example in the description, thread 1 holds ckpt open for write.
Thread 2 attempts to rename ckpt to curFile. On Linux and OS X, the rename
succeeds, but now the fsimage file has incomplete content, because thread 1
hasn't finished writing into it. On Windows, the rename fails, so we delete
the existing fsimage. However, the second attempt at renaming also fails
because thread 1 holds a lock on the file.
The problem is more visible on Windows. We see an intermittent failure in
{{TestStandbyCheckpoints#testBothNodesInStandbyState}}.
One potential solution would be for {{GetImageServlet}} to take the namesystem
lock while downloading the image from the other namenode and renaming.
{{StandbyCheckpointer#doCheckpoint}} already acquires the namesystem lock, so
doing the same in {{GetImageServlet}} would enforce mutual exclusion.
Note that we are already protected against multiple put image calls by
{{GetImageServlet}} tracking currently downloading checkpoints and rejecting
duplicated requests. This is just a race condition with the namenode running
its own checkpoint.
> race condition between 2 namenodes in standby that are trying to checkpoint
> with one another can delete or corrupt a good fsimage
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-4811
> URL: https://issues.apache.org/jira/browse/HDFS-4811
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 3.0.0, 2.0.5-beta
> Reporter: Chris Nauroth
>
> The problem occurs under concurrent execution of the namenode running its own
> checkpoint in {{StandbyCheckpointer}} in thread 1 while also getting a
> checkpoint from a different namenode in {{GetImageServlet}} in thread 2. It
> is possible for thread 2 to finish writing the checkpoint to the directory,
> but then get suspended before it has a chance to rename it to its final
> destination as an fsimage file. Then, thread 1 wakes up and starts writing
> its own data to the checkpoint file. When thread 2 resumes, it then tries to
> rename the file that thread 1 still holds open for writing. Depending on OS,
> this either moves thread 1's incomplete checkpoint to fsimage, or it just
> outright deletes the existing good fsimage until thread 1 finishes writing
> and renames.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira