[ https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tianyin Xu updated HDFS-7727: ----------------------------- Attachment: check_config_SshFenceByTcpPort.1.patch > Check and verify the auto-fence settings to prevent failures of auto-failover > ----------------------------------------------------------------------------- > > Key: HDFS-7727 > URL: https://issues.apache.org/jira/browse/HDFS-7727 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.4.1, 2.6.0, 2.5.1 > Reporter: Tianyin Xu > Attachments: check_config_SshFenceByTcpPort.1.patch > > > ============================ > Problem > ------------------------------------------------- > Currently, the auto-failover feature of HDFS only checks the settings of the > parameter "dfs.ha.fencing.methods" but not the settings of the other > "dfs.ha.fencing.*" parameters. > Basically, the configuration settings of other "dfs.ha.fencing" are not > checked and verified at initialization but directly parsed and applied at > runtime. Any configuration errors would prevent the auto-failover. > Since the values are used to deal with failures (auto-failover) so you won't > notice the errors until the active NameNode fails and triggers the fence > procedure in the auto-failover process. > ============================ > Parameters > ------------------------------------------------- > In SSHFence, there are two configuration parameters defined in > SshFenceByTcpPort.java > "dfs.ha.fencing.ssh.connect-timeout"; > "dfs.ha.fencing.ssh.private-key-files" > They are used in the tryFence() function for auto-fencing. > Any erroneous settings of these two parameters would result in uncaught > exceptions that would prevent the fencing and autofailover. We have verified > this by setting a two-NameNode autofailover cluster and manually kill the > active NameNode. The passive NameNode cannot takeover. > For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include > ill-formatted integers and negative integers for > dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). > For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include > non-existent private-key file path or wrong permissions that fail > jsch.addIdentity() in the createSession() method. > The following gives one example of the failure casued by misconfiguring the > "dfs.ha.fencing.ssh.private-key-files" parameter. > {code} > 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== > Beginning Service Fencing Process... ====== > 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method > 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) > 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable > to create SSH session > com.jcraft.jsch.JSchException: java.io.FileNotFoundException: > /home/hadoop/.ssh/id_rsax (No such file or directory) > at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98) > at com.jcraft.jsch.JSch.addIdentity(JSch.java:206) > at com.jcraft.jsch.JSch.addIdentity(JSch.java:192) > at > org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122) > at > org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91) > at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) > at > org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) > at > org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) > at > org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) > at > org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) > at > org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such > file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.<init>(FileInputStream.java:146) > at java.io.FileInputStream.<init>(FileInputStream.java:101) > at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83) > ... 14 more > {code} > ============================ > Solution (the patch) > ------------------------------------------------- > Check the configuration settings in the checkArgs() function. Currently, > checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" > but not the settings of the other "dfs.ha.fencing.*" parameters. > {code:title=SshFenceByTcpPort.java|borderStyle=solid} > /** > * Verify that the argument, if given, in the conf is parseable. > */ > @Override > public void checkArgs(String argStr) throws > BadFencingConfigurationException { > if (argStr != null) { > new Args(argStr); > } > <= Insert the checkers here (see the patch attached) > } > {code} > The detailed patch is shown below. > {code} > @@ -76,6 +77,23 @@ > if (argStr != null) { > new Args(argStr); > } > + > + //The configuration could be empty (e.g., called from DFSHAAdmin) > + if(getConf().size() > 0) { > + //check ssh.connect-timeout > + if(getSshConnectTimeout() <= 0) > + throw new BadFencingConfigurationException( > + CONF_CONNECT_TIMEOUT_KEY + > + "property value should be positive and non-zero"); > + > + //check the settings of dfs.ha.fencing.ssh.private-key-files > + for (String keyFilePath : getKeyFiles()) { > + File keyFile = new File(keyFilePath); > + if(!keyFile.isFile() || !keyFile.canRead()) > + throw new BadFencingConfigurationException( > + "The configured private key file is invalid: " + > keyFilePath); > + } > + } > } > > @Override > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)