[ https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tianyin Xu updated HDFS-7727: ----------------------------- Description: ============================ Problem ------------------------------------------------- Currently, the auto-failover feature of HDFS only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters. Basically, the configuration settings of other "dfs.ha.fencing" are not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would prevent the auto-failover. Since the values are used to deal with failures (auto-failover) so you won't notice the errors until the active NameNode fails and triggers the fence procedure in the auto-failover process. ============================ Parameters ------------------------------------------------- In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java "dfs.ha.fencing.ssh.connect-timeout"; "dfs.ha.fencing.ssh.private-key-files" They are used in the tryFence() function for auto-fencing. Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover. For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method. The following gives one example of the failure casued by misconfiguring the "dfs.ha.fencing.ssh.private-key-files" parameter. {code} 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ====== 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98) at com.jcraft.jsch.JSch.addIdentity(JSch.java:206) at com.jcraft.jsch.JSch.addIdentity(JSch.java:192) at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:146) at java.io.FileInputStream.<init>(FileInputStream.java:101) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83) ... 14 more {code} ============================ Solution (the patch) ------------------------------------------------- Check the configuration settings in the checkArgs() function. Currently, checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters. {code:title=SshFenceByTcpPort.java|borderStyle=solid} /** * Verify that the argument, if given, in the conf is parseable. */ @Override public void checkArgs(String argStr) throws BadFencingConfigurationException { if (argStr != null) { new Args(argStr); } <= Insert the checkers here (see the patch attached) } {code} The detailed patch is shown below. {code} @@ -76,6 +77,23 @@ if (argStr != null) { new Args(argStr); } + + //The configuration could be empty (e.g., called from DFSHAAdmin) + if(getConf().size() > 0) { + //check ssh.connect-timeout + if(getSshConnectTimeout() <= 0) + throw new BadFencingConfigurationException( + CONF_CONNECT_TIMEOUT_KEY + + "property value should be positive and non-zero"); + + //check the settings of dfs.ha.fencing.ssh.private-key-files + for (String keyFilePath : getKeyFiles()) { + File keyFile = new File(keyFilePath); + if(!keyFile.isFile() || !keyFile.canRead()) + throw new BadFencingConfigurationException( + "The configured private key file is invalid: " + keyFilePath); + } + } } @Override {code} Thanks! was: ============================ Problem ------------------------------------------------- Currently, the auto-failover feature of HDFS only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters. Basically, the configuration settings of other "dfs.ha.fencing" are not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would prevent the auto-failover. Since the values are used to deal with failures (auto-failover) so you won't notice the errors until the active NameNode fails and triggers the fence procedure in the auto-failover process. ============================ Parameters ------------------------------------------------- In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java "dfs.ha.fencing.ssh.connect-timeout"; "dfs.ha.fencing.ssh.private-key-files" They are used in the tryFence() function for auto-fencing. Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover. For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method. The following gives one example of the failure casued by misconfiguring the "dfs.ha.fencing.ssh.private-key-files" parameter. {code} 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ====== 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98) at com.jcraft.jsch.JSch.addIdentity(JSch.java:206) at com.jcraft.jsch.JSch.addIdentity(JSch.java:192) at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:146) at java.io.FileInputStream.<init>(FileInputStream.java:101) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83) ... 14 more {code} ============================ Solution (the patch) ------------------------------------------------- Check the configuration settings in the checkArgs() function. Currently, checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters. {code:title=SshFenceByTcpPort.java|borderStyle=solid} /** * Verify that the argument, if given, in the conf is parseable. */ @Override public void checkArgs(String argStr) throws BadFencingConfigurationException { if (argStr != null) { new Args(argStr); } <= Insert the checkers here (see the patch attached) } {code} {code} --- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/SshFenceByTcpPort.java 2014-06-20 22:40:07.000000000 -0700 +++ SshFenceByTcpPort.java 2015-02-03 01:06:47.125577795 -0800 @@ -17,6 +17,7 @@ */ package org.apache.hadoop.ha; +import java.io.File; import java.io.IOException; import java.net.InetSocketAddress; import java.util.Collection; @@ -76,6 +77,23 @@ if (argStr != null) { new Args(argStr); } + + //The configuration could be empty (e.g., called from DFSHAAdmin) + if(getConf().size() > 0) { + //check ssh.connect-timeout + if(getSshConnectTimeout() <= 0) + throw new BadFencingConfigurationException( + CONF_CONNECT_TIMEOUT_KEY + + "property value should be positive and non-zero"); + + //check the settings of dfs.ha.fencing.ssh.private-key-files + for (String keyFilePath : getKeyFiles()) { + File keyFile = new File(keyFilePath); + if(!keyFile.isFile() || !keyFile.canRead()) + throw new BadFencingConfigurationException( + "The configured private key file is invalid: " + keyFilePath); + } + } } @Override {code} Thanks! > Check and verify the auto-fence settings to prevent failures of auto-failover > ----------------------------------------------------------------------------- > > Key: HDFS-7727 > URL: https://issues.apache.org/jira/browse/HDFS-7727 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.4.1, 2.6.0, 2.5.1 > Reporter: Tianyin Xu > Attachments: check_config_SshFenceByTcpPort.1.patch > > > ============================ > Problem > ------------------------------------------------- > Currently, the auto-failover feature of HDFS only checks the settings of the > parameter "dfs.ha.fencing.methods" but not the settings of the other > "dfs.ha.fencing.*" parameters. > Basically, the configuration settings of other "dfs.ha.fencing" are not > checked and verified at initialization but directly parsed and applied at > runtime. Any configuration errors would prevent the auto-failover. > Since the values are used to deal with failures (auto-failover) so you won't > notice the errors until the active NameNode fails and triggers the fence > procedure in the auto-failover process. > ============================ > Parameters > ------------------------------------------------- > In SSHFence, there are two configuration parameters defined in > SshFenceByTcpPort.java > "dfs.ha.fencing.ssh.connect-timeout"; > "dfs.ha.fencing.ssh.private-key-files" > They are used in the tryFence() function for auto-fencing. > Any erroneous settings of these two parameters would result in uncaught > exceptions that would prevent the fencing and autofailover. We have verified > this by setting a two-NameNode autofailover cluster and manually kill the > active NameNode. The passive NameNode cannot takeover. > For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include > ill-formatted integers and negative integers for > dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). > For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include > non-existent private-key file path or wrong permissions that fail > jsch.addIdentity() in the createSession() method. > The following gives one example of the failure casued by misconfiguring the > "dfs.ha.fencing.ssh.private-key-files" parameter. > {code} > 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== > Beginning Service Fencing Process... ====== > 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method > 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) > 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable > to create SSH session > com.jcraft.jsch.JSchException: java.io.FileNotFoundException: > /home/hadoop/.ssh/id_rsax (No such file or directory) > at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98) > at com.jcraft.jsch.JSch.addIdentity(JSch.java:206) > at com.jcraft.jsch.JSch.addIdentity(JSch.java:192) > at > org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122) > at > org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91) > at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) > at > org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) > at > org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) > at > org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) > at > org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) > at > org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such > file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.<init>(FileInputStream.java:146) > at java.io.FileInputStream.<init>(FileInputStream.java:101) > at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83) > ... 14 more > {code} > ============================ > Solution (the patch) > ------------------------------------------------- > Check the configuration settings in the checkArgs() function. Currently, > checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" > but not the settings of the other "dfs.ha.fencing.*" parameters. > {code:title=SshFenceByTcpPort.java|borderStyle=solid} > /** > * Verify that the argument, if given, in the conf is parseable. > */ > @Override > public void checkArgs(String argStr) throws > BadFencingConfigurationException { > if (argStr != null) { > new Args(argStr); > } > <= Insert the checkers here (see the patch attached) > } > {code} > The detailed patch is shown below. > {code} > @@ -76,6 +77,23 @@ > if (argStr != null) { > new Args(argStr); > } > + > + //The configuration could be empty (e.g., called from DFSHAAdmin) > + if(getConf().size() > 0) { > + //check ssh.connect-timeout > + if(getSshConnectTimeout() <= 0) > + throw new BadFencingConfigurationException( > + CONF_CONNECT_TIMEOUT_KEY + > + "property value should be positive and non-zero"); > + > + //check the settings of dfs.ha.fencing.ssh.private-key-files > + for (String keyFilePath : getKeyFiles()) { > + File keyFile = new File(keyFilePath); > + if(!keyFile.isFile() || !keyFile.canRead()) > + throw new BadFencingConfigurationException( > + "The configured private key file is invalid: " + > keyFilePath); > + } > + } > } > > @Override > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)