[
https://issues.apache.org/jira/browse/HDFS-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529699#comment-14529699
]
Aaron T. Myers commented on HDFS-6440:
--------------------------------------
Hi Jesse and Lars,
My sincere apologies it took so long for me to post a review. No good excuse
except being busy, but what else is new.
Anyway, the patch looks pretty good to me. Most everything that's below is
pretty small stuff.
One small potential correctness issue:
# In {{StandbyCheckpointer#doCheckpoint}}, unless I'm missing something, I
don't think the variable "{{ie}}" can ever be non-null, and yet we check for
whether or not it's null later in the method to determine if we should shut
down.
Two things I'd really like to see some test coverage for:
# The changes to {{BlockTokenSecretManager}} - they look fine to me in general,
but I'd love to see some extra tests of this functionality with several NNs in
play. Unless I missed something, I don't think there are any tests that would
exercise more than 2 {{BlockTokenSecretManager}}s.
# Rolling upgrades/downgrades/rollbacks. I agree with you in general that this
change should likely not affect anything, but I think it's important that we
have some test(s) exercising this regardless.
Several little nits:
# In {{MiniZKFCCluster}}, this method now supports more than just two services:
"+ * Set up two services and their failover controllers."
# Recommend making {{intRange}} and {{nnRangeStart}} final in
{{BlockTokenSecretManager}}.
# Should document the behavior of both of the newly-introduced config keys
(dfs.namenode.checkpoint.check.quiet-multiplier and
dfs.hs.tail-edits.namenode-retries) in hdfs-default.xml.
# I think this error message could be a bit clearer:
{quote}
+ "Node is currently not in the active state, state:" +
state +
+ " does not support reading FSImages from other NameNodes");
{quote}
Recommend something like "NameNode <hostname or IP address> is currently not in
a state which can accept uploads of new fsimages. State: <state>".
# Would be great for debugging purposes if we could include the hostname or IP
address of the checkpointer already doing the upload with the higher txid in
this message:
{quote}
+ "Another checkpointer is already in the process of
uploading a" +
+ " checkpoint made up to transaction ID " +
larger.last());
{quote}
# Spelled "failure" incorrectly here: "AUTHENTICATION_FAILRE"
# Sorry, I don't quite follow this comment in {{BootstrapStandby}}:
{quote}
+ // get the namespace from any active NN. On a fresh cluster, this is
the active. On a
+ // running cluster, this works on any node.
{quote}
What's a "fresh cluster" vs. a "running cluster" in this sense?
# In {{HATestUtil#waitForStandbyToCatchUp}}, looks like you changed the method
comment to indicate that the method takes multiple standbys as an argument, but
in fact the method functionality is unchanged. There's just some whitespace
changes in that method.
# In {{TestPipelinesFailover#doWriteOverFailoverTest}}, is changing the value
of {{FAILOVER_SEED}} going to do anything, given that it's only ever read at
the static initialization of the {{failoverRandom}}?
Also, not a problem at all, but just want to say that I really like the way
this patch changes TransferFsImage, and the additional diagnostic info it
provides when uploads fail. That's a nice little improvement by itself.
I'll be +1 once this stuff is addressed.
> Support more than 2 NameNodes
> -----------------------------
>
> Key: HDFS-6440
> URL: https://issues.apache.org/jira/browse/HDFS-6440
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: auto-failover, ha, namenode
> Affects Versions: 2.4.0
> Reporter: Jesse Yates
> Assignee: Jesse Yates
> Attachments: Multiple-Standby-NameNodes_V1.pdf,
> hdfs-6440-cdh-4.5-full.patch, hdfs-6440-trunk-v1.patch,
> hdfs-6440-trunk-v1.patch, hdfs-multiple-snn-trunk-v0.patch
>
>
> Most of the work is already done to support more than 2 NameNodes (one
> active, one standby). This would be the last bit to support running multiple
> _standby_ NameNodes; one of the standbys should be available for fail-over.
> Mostly, this is a matter of updating how we parse configurations, some
> complexity around managing the checkpointing, and updating a whole lot of
> tests.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)