Tianyin Xu created HDFS-7727:
--------------------------------
Summary: Check and verify the auto-fence settings to prevent
failures of auto-failover
Key: HDFS-7727
URL: https://issues.apache.org/jira/browse/HDFS-7727
Project: Hadoop HDFS
Issue Type: Bug
Components: auto-failover
Affects Versions: 2.5.1, 2.6.0, 2.4.1
Reporter: Tianyin Xu
Sorry for reporting similar problems, but the problems resides in different
components, and this one has more severe consequence (well, this's my last
report of this type of problems).
============================
Problem
-------------------------------------------------
The problem is similar as the following issues resolved in Yarn,
https://issues.apache.org/jira/browse/YARN-2165
https://issues.apache.org/jira/browse/YARN-2166
and reported (by me) in HDFS EditLogTailer,
https://issues.apache.org/jira/browse/HDFS-7726
Basically, the configuration settings is not checked and verified at
initialization but directly parsed and applied at runtime. Any configuration
errors would impair the corresponding components (since the exceptions are not
caught).
In this case, the values are used in auto-failover so you won't notice the
errors until one of the NameNode fails and triggers the fence procedure in the
auto-failover process.
============================
Parameters
-------------------------------------------------
In SSHFence, there are two configuration parameters defined in
SshFenceByTcpPort.java
"dfs.ha.fencing.ssh.connect-timeout";
"dfs.ha.fencing.ssh.private-key-files"
They are used in the tryFence() function for auto-fencing.
Any erroneous settings of these two parameters would result in uncaught
exceptions that would prevent the fencing and impair autofailover. We have
verified this by setting a two-NameNode autofailover cluster and manually kill
the active NameNode. The passive NameNode cannot takeover successfully.
For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include
ill-formatted integers and negative integers for
dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).
For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include
non-existent private-key file path or wrong permissions that fail
jsch.addIdentity() in the createSession() method.
I think actively checking the settings in the constructor of the class (in the
same way as YARN-2165, YARN-2166, HDFS-7726) should be able to fix the problems.
Thanks!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)