[jira] [Commented] (HDFS-3597) SNN can fail to start on upgrade

Todd Lipcon (JIRA) Thu, 05 Jul 2012 15:29:38 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407560#comment-13407560
 ]


Todd Lipcon commented on HDFS-3597:
-----------------------------------

bq. That's an issue I was confused about too. I don't understand why the test 
has multiple checkpoint dirs, nor why my 2NN is running in 
snn.getCheckpointDirs().get(1) rather than .get(0). (If I corrupt the first 
checkpointdir, there is no perceptible effect on the testcase.) The println is 
a leftover from when I was still attempting to exercise the upgrade code.

The 2NN can be configured with multiple directories. Our tests make use of that 
feature:

{code}
        conf.set(DFS_NAMENODE_CHECKPOINT_DIR_KEY,
            fileAsURI(new File(base_dir, "namesecondary" + (2*nnIndex + 
1)))+","+
            fileAsURI(new File(base_dir, "namesecondary" + (2*nnIndex + 2))));
{code}
(from MiniDFSCluster source)

I bet we have some bug/feature whereby if only one of the two is corrupted, the 
behavior depends on which of the two it was. My guess is we iterate over each 
of the dirs during startup, and load the properties from each, so it's the last 
one which takes precedence by the time we get to the version checking code. 
Might be worth fixing this in a separate JIRA (out of scope for this one)

Given the above, I think it makes sense to edit the VERSION file in both of 
those directories, though, since you're basically depending on some other bug 
in this test case currently.

Will look at your new patch later this afternoon.
                
> SNN can fail to start on upgrade
> --------------------------------
>
>                 Key: HDFS-3597
>                 URL: https://issues.apache.org/jira/browse/HDFS-3597
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
>         Attachments: hdfs-3597-2.txt, hdfs-3597.txt
>
>
> When upgrading from 1.x to 2.0.0, the SecondaryNameNode can fail to start up:
> {code}
> 2012-06-16 09:52:33,812 ERROR 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in 
> doCheckpoint
> java.io.IOException: Inconsistent checkpoint fields.
> LV = -40 namespaceID = 64415959 cTime = 1339813974990 ; clusterId = 
> CID-07a82b97-8d04-4fdd-b3a1-f40650163245 ; blockpoolId = 
> BP-1792677198-172.29.121.67-1339813967723.
> Expecting respectively: -19; 64415959; 0; ; .
> at 
> org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:120)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:454)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:334)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:301)
> at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The error check we're hitting came from HDFS-1073, and it's intended to 
> verify that we're connecting to the correct NN.  But the check is too strict 
> and considers "different metadata version" to be the same as "different 
> clusterID".
> I believe the check in {{doCheckpoint}} simply needs to explicitly check for 
> and handle the update case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3597) SNN can fail to start on upgrade

Reply via email to