[ https://issues.apache.org/jira/browse/HDFS-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Foley updated HDFS-1921: ----------------------------- Attachment: hdfs-1921-2.patch @Todd: nice tweak to the unit test. I changed the name of the subroutine to "doTestFailedSaveNamespace", since it isn't a test case in its own right. @Suresh: bq. Code of thread starting logic is duplicated. It could be added to a separate method. Sounded right, so I implemented the suggestion, and then concluded it made the code _more_ complex instead of better, because of the way it worked out with the try/catch context and the management of the errorSDs list. bq. Also continue in catch block is redundant. The "continue"s are there for defensive coding: If someone adds statements after the catch context, but within the loop, I believe the catch context should go to the next loop iteration immediately. .bq Minor: per the coding guidelines please add { } after if statements. Done, thanks. One more time :-) > Save namespace can cause NN to be unable to come up on restart > -------------------------------------------------------------- > > Key: HDFS-1921 > URL: https://issues.apache.org/jira/browse/HDFS-1921 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.22.0, 0.23.0 > Reporter: Aaron T. Myers > Assignee: Matt Foley > Priority: Blocker > Fix For: 0.22.0, 0.23.0 > > Attachments: hdfs-1505-1-test.txt, hdfs-1921-2.patch, hdfs-1921.txt, > hdfs1921_v23.patch, hdfs1921_v23.patch > > > I discovered this in the course of trying to implement a fix for HDFS-1505. > Per the comment for {{FSImage.saveNamespace(...)}}, the algorithm for save > namespace proceeds in the following order: > # rename current to lastcheckpoint.tmp for all of them, > # save image and recreate edits for all of them, > # rename lastcheckpoint.tmp to previous.checkpoint. > The problem is that step 3 occurs regardless of whether or not an error > occurs for all storage directories in step 2. Upon restart, the NN will see > non-existent or corrupt {{current}} directories, and no > {{lastcheckpoint.tmp}} directories, and so will conclude that the storage > directories are not formatted. > This issue appears to be present on both 0.22 and 0.23. This should arguably > be a 0.22/0.23 blocker. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira