[ https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tianyin Xu updated HDFS-7727: ----------------------------- Description: ============================ Problem ------------------------------------------------- Currently, the auto-failover feature of HDFS only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters. Basically, the configuration settings of other "dfs.ha.fencing" are not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would prevent the auto-failover. Since the values are used to deal with failures (auto-failover) so you won't notice the errors until the active NameNode fails and triggers the fence procedure in the auto-failover process. ============================ Parameters ------------------------------------------------- In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java "dfs.ha.fencing.ssh.connect-timeout"; "dfs.ha.fencing.ssh.private-key-files" They are used in the tryFence() function for auto-fencing. Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover. For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method. The following gives one example of the failure casued by misconfiguring the "dfs.ha.fencing.ssh.private-key-files" parameter. {code} 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ====== 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to create SSH session com.jcraft.jsch.JSchException: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98) at com.jcraft.jsch.JSch.addIdentity(JSch.java:206) at com.jcraft.jsch.JSch.addIdentity(JSch.java:192) at org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:146) at java.io.FileInputStream.<init>(FileInputStream.java:101) at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83) ... 14 more {code} ============================ Solution (the patch) ------------------------------------------------- Check the configuration settings in the checkArgs() function. Currently, checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" but not the settings of the other "dfs.ha.fencing.*" parameters. {code:title=SshFenceByTcpPort.java|borderStyle=solid} /** * Verify that the argument, if given, in the conf is parseable. */ @Override public void checkArgs(String argStr) throws BadFencingConfigurationException { if (argStr != null) { new Args(argStr); } <= Insert the checkers here (see the patch attached) } {code} Thanks! was: Sorry for reporting similar problems, but the problems resides in different components, and this one has more severe consequence (well, this's my last report of this type of problems). ============================ Problem ------------------------------------------------- The problem is similar as the following issues resolved in Yarn, https://issues.apache.org/jira/browse/YARN-2165 https://issues.apache.org/jira/browse/YARN-2166 and reported (by me) in HDFS EditLogTailer, https://issues.apache.org/jira/browse/HDFS-7726 Basically, the configuration settings is not checked and verified at initialization but directly parsed and applied at runtime. Any configuration errors would impair the corresponding components (since the exceptions are not caught). In this case, the values are used in auto-failover so you won't notice the errors until one of the NameNode fails and triggers the fence procedure in the auto-failover process. ============================ Parameters ------------------------------------------------- In SSHFence, there are two configuration parameters defined in SshFenceByTcpPort.java "dfs.ha.fencing.ssh.connect-timeout"; "dfs.ha.fencing.ssh.private-key-files" They are used in the tryFence() function for auto-fencing. Any erroneous settings of these two parameters would result in uncaught exceptions that would prevent the fencing and impair autofailover. We have verified this by setting a two-NameNode autofailover cluster and manually kill the active NameNode. The passive NameNode cannot takeover successfully. For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include ill-formatted integers and negative integers for dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include non-existent private-key file path or wrong permissions that fail jsch.addIdentity() in the createSession() method. I think actively checking the settings in the constructor of the class (in the same way as YARN-2165, YARN-2166, HDFS-7726) should be able to fix the problems. Thanks! > Check and verify the auto-fence settings to prevent failures of auto-failover > ----------------------------------------------------------------------------- > > Key: HDFS-7727 > URL: https://issues.apache.org/jira/browse/HDFS-7727 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.4.1, 2.6.0, 2.5.1 > Reporter: Tianyin Xu > Attachments: check_config_SshFenceByTcpPort.1.patch > > > ============================ > Problem > ------------------------------------------------- > Currently, the auto-failover feature of HDFS only checks the settings of the > parameter "dfs.ha.fencing.methods" but not the settings of the other > "dfs.ha.fencing.*" parameters. > Basically, the configuration settings of other "dfs.ha.fencing" are not > checked and verified at initialization but directly parsed and applied at > runtime. Any configuration errors would prevent the auto-failover. > Since the values are used to deal with failures (auto-failover) so you won't > notice the errors until the active NameNode fails and triggers the fence > procedure in the auto-failover process. > ============================ > Parameters > ------------------------------------------------- > In SSHFence, there are two configuration parameters defined in > SshFenceByTcpPort.java > "dfs.ha.fencing.ssh.connect-timeout"; > "dfs.ha.fencing.ssh.private-key-files" > They are used in the tryFence() function for auto-fencing. > Any erroneous settings of these two parameters would result in uncaught > exceptions that would prevent the fencing and autofailover. We have verified > this by setting a two-NameNode autofailover cluster and manually kill the > active NameNode. The passive NameNode cannot takeover. > For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include > ill-formatted integers and negative integers for > dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). > For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include > non-existent private-key file path or wrong permissions that fail > jsch.addIdentity() in the createSession() method. > The following gives one example of the failure casued by misconfiguring the > "dfs.ha.fencing.ssh.private-key-files" parameter. > {code} > 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== > Beginning Service Fencing Process... ====== > 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method > 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) > 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable > to create SSH session > com.jcraft.jsch.JSchException: java.io.FileNotFoundException: > /home/hadoop/.ssh/id_rsax (No such file or directory) > at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98) > at com.jcraft.jsch.JSch.addIdentity(JSch.java:206) > at com.jcraft.jsch.JSch.addIdentity(JSch.java:192) > at > org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122) > at > org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91) > at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) > at > org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) > at > org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) > at > org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) > at > org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) > at > org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such > file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.<init>(FileInputStream.java:146) > at java.io.FileInputStream.<init>(FileInputStream.java:101) > at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83) > ... 14 more > {code} > ============================ > Solution (the patch) > ------------------------------------------------- > Check the configuration settings in the checkArgs() function. Currently, > checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" > but not the settings of the other "dfs.ha.fencing.*" parameters. > {code:title=SshFenceByTcpPort.java|borderStyle=solid} > /** > * Verify that the argument, if given, in the conf is parseable. > */ > @Override > public void checkArgs(String argStr) throws > BadFencingConfigurationException { > if (argStr != null) { > new Args(argStr); > } > <= Insert the checkers here (see the patch attached) > } > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)