[
https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378208#comment-14378208
]
Zhe Zhang commented on HDFS-7727:
---------------------------------
Thanks Tianyin for the patch! I think it makes sense to move argument sanity
check earlier. A few questions:
# Assuming there's a bad key file and a few good ones, is the current ssh
fencing logic (without the patch) able to skip the bad file?
# Since the logic of handling these 2 config parameters are outside of the
{{SshFenceByTcpPort}} class, I think it's worth testing the behavior (like the
above) without the patch first.
> Check and verify the auto-fence settings to prevent failures of auto-failover
> -----------------------------------------------------------------------------
>
> Key: HDFS-7727
> URL: https://issues.apache.org/jira/browse/HDFS-7727
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: auto-failover
> Affects Versions: 2.4.1, 2.6.0, 2.5.1
> Reporter: Tianyin Xu
> Assignee: Tianyin Xu
> Attachments: check_config_SshFenceByTcpPort.1.patch
>
>
> ============================
> Problem
> -------------------------------------------------
> Currently, the auto-failover feature of HDFS only checks the settings of the
> parameter "dfs.ha.fencing.methods" but not the settings of the other
> "dfs.ha.fencing.*" parameters.
> Basically, the configuration settings of other "dfs.ha.fencing" are not
> checked and verified at initialization but directly parsed and applied at
> runtime. Any configuration errors would prevent the auto-failover.
> Since the values are used to deal with failures (auto-failover) so you won't
> notice the errors until the active NameNode fails and triggers the fence
> procedure in the auto-failover process.
> ============================
> Parameters
> -------------------------------------------------
> In SSHFence, there are two configuration parameters defined in
> SshFenceByTcpPort.java
> "dfs.ha.fencing.ssh.connect-timeout";
> "dfs.ha.fencing.ssh.private-key-files"
> They are used in the tryFence() function for auto-fencing.
> Any erroneous settings of these two parameters would result in uncaught
> exceptions that would prevent the fencing and autofailover. We have verified
> this by setting a two-NameNode autofailover cluster and manually kill the
> active NameNode. The passive NameNode cannot takeover.
> For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include
> ill-formatted integers and negative integers for
> dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).
> For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include
> non-existent private-key file path or wrong permissions that fail
> jsch.addIdentity() in the createSession() method.
> The following gives one example of the failure casued by misconfiguring the
> "dfs.ha.fencing.ssh.private-key-files" parameter.
> {code}
> 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method
> 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable
> to create SSH session
> com.jcraft.jsch.JSchException: java.io.FileNotFoundException:
> /home/hadoop/.ssh/id_rsax (No such file or directory)
> at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98)
> at com.jcraft.jsch.JSch.addIdentity(JSch.java:206)
> at com.jcraft.jsch.JSch.addIdentity(JSch.java:192)
> at
> org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122)
> at
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91)
> at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
> at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
> at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
> at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
> at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such
> file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.<init>(FileInputStream.java:146)
> at java.io.FileInputStream.<init>(FileInputStream.java:101)
> at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83)
> ... 14 more
> {code}
> ============================
> Solution (the patch)
> -------------------------------------------------
> Check the configuration settings in the checkArgs() function. Currently,
> checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods"
> but not the settings of the other "dfs.ha.fencing.*" parameters.
> {code:title=SshFenceByTcpPort.java|borderStyle=solid}
> /**
> * Verify that the argument, if given, in the conf is parseable.
> */
> @Override
> public void checkArgs(String argStr) throws
> BadFencingConfigurationException {
> if (argStr != null) {
> new Args(argStr);
> }
> <= Insert the checkers here (see the patch attached)
> }
> {code}
> The detailed patch is shown below.
> {code}
> @@ -76,6 +77,23 @@
> if (argStr != null) {
> new Args(argStr);
> }
> +
> + //The configuration could be empty (e.g., called from DFSHAAdmin)
> + if(getConf().size() > 0) {
> + //check ssh.connect-timeout
> + if(getSshConnectTimeout() <= 0)
> + throw new BadFencingConfigurationException(
> + CONF_CONNECT_TIMEOUT_KEY +
> + "property value should be positive and non-zero");
> +
> + //check the settings of dfs.ha.fencing.ssh.private-key-files
> + for (String keyFilePath : getKeyFiles()) {
> + File keyFile = new File(keyFilePath);
> + if(!keyFile.isFile() || !keyFile.canRead())
> + throw new BadFencingConfigurationException(
> + "The configured private key file is invalid: " +
> keyFilePath);
> + }
> + }
> }
>
> @Override
> {code}
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)