[jira] [Commented] (HDFS-4811) race condition between 2 namenodes in standby that are trying to checkpoint with one another can delete or corrupt a good fsimage

Todd Lipcon (JIRA) Thu, 09 May 2013 14:43:17 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653262#comment-13653262
 ]


Todd Lipcon commented on HDFS-4811:
-----------------------------------

[~andrew.wang] - would your fix you were working on to use timestamped tmp 
files for checkpoints also fix this?
                
> race condition between 2 namenodes in standby that are trying to checkpoint 
> with one another can delete or corrupt a good fsimage
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4811
>                 URL: https://issues.apache.org/jira/browse/HDFS-4811
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0, 2.0.5-beta
>            Reporter: Chris Nauroth
>
> The problem occurs under concurrent execution of the namenode running its own 
> checkpoint in {{StandbyCheckpointer}} in thread 1 while also getting a 
> checkpoint from a different namenode in {{GetImageServlet}} in thread 2.  It 
> is possible for thread 2 to finish writing the checkpoint to the directory, 
> but then get suspended before it has a chance to rename it to its final 
> destination as an fsimage file.  Then, thread 1 wakes up and starts writing 
> its own data to the checkpoint file.  When thread 2 resumes, it then tries to 
> rename the file that thread 1 still holds open for writing.  Depending on OS, 
> this either moves thread 1's incomplete checkpoint to fsimage, or it just 
> outright deletes the existing good fsimage until thread 1 finishes writing 
> and renames.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4811) race condition between 2 namenodes in standby that are trying to checkpoint with one another can delete or corrupt a good fsimage

Reply via email to