[
https://issues.apache.org/jira/browse/HDFS-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407560#comment-13407560
]
Todd Lipcon commented on HDFS-3597:
-----------------------------------
bq. That's an issue I was confused about too. I don't understand why the test
has multiple checkpoint dirs, nor why my 2NN is running in
snn.getCheckpointDirs().get(1) rather than .get(0). (If I corrupt the first
checkpointdir, there is no perceptible effect on the testcase.) The println is
a leftover from when I was still attempting to exercise the upgrade code.
The 2NN can be configured with multiple directories. Our tests make use of that
feature:
{code}
conf.set(DFS_NAMENODE_CHECKPOINT_DIR_KEY,
fileAsURI(new File(base_dir, "namesecondary" + (2*nnIndex +
1)))+","+
fileAsURI(new File(base_dir, "namesecondary" + (2*nnIndex + 2))));
{code}
(from MiniDFSCluster source)
I bet we have some bug/feature whereby if only one of the two is corrupted, the
behavior depends on which of the two it was. My guess is we iterate over each
of the dirs during startup, and load the properties from each, so it's the last
one which takes precedence by the time we get to the version checking code.
Might be worth fixing this in a separate JIRA (out of scope for this one)
Given the above, I think it makes sense to edit the VERSION file in both of
those directories, though, since you're basically depending on some other bug
in this test case currently.
Will look at your new patch later this afternoon.
> SNN can fail to start on upgrade
> --------------------------------
>
> Key: HDFS-3597
> URL: https://issues.apache.org/jira/browse/HDFS-3597
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.0.0-alpha
> Reporter: Andy Isaacson
> Assignee: Andy Isaacson
> Priority: Minor
> Attachments: hdfs-3597-2.txt, hdfs-3597.txt
>
>
> When upgrading from 1.x to 2.0.0, the SecondaryNameNode can fail to start up:
> {code}
> 2012-06-16 09:52:33,812 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
> doCheckpoint
> java.io.IOException: Inconsistent checkpoint fields.
> LV = -40 namespaceID = 64415959 cTime = 1339813974990 ; clusterId =
> CID-07a82b97-8d04-4fdd-b3a1-f40650163245 ; blockpoolId =
> BP-1792677198-172.29.121.67-1339813967723.
> Expecting respectively: -19; 64415959; 0; ; .
> at
> org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:120)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:454)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:334)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:301)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
> at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The error check we're hitting came from HDFS-1073, and it's intended to
> verify that we're connecting to the correct NN. But the check is too strict
> and considers "different metadata version" to be the same as "different
> clusterID".
> I believe the check in {{doCheckpoint}} simply needs to explicitly check for
> and handle the update case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira