[ 
https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated HDFS-7727:
-----------------------------
    Attachment: check_config_SshFenceByTcpPort.1.patch

> Check and verify the auto-fence settings to prevent failures of auto-failover
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-7727
>                 URL: https://issues.apache.org/jira/browse/HDFS-7727
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>    Affects Versions: 2.4.1, 2.6.0, 2.5.1
>            Reporter: Tianyin Xu
>         Attachments: check_config_SshFenceByTcpPort.1.patch
>
>
> ============================
> Problem
> -------------------------------------------------
> Currently, the auto-failover feature of HDFS only checks the settings of the 
> parameter "dfs.ha.fencing.methods" but not the settings of the other 
> "dfs.ha.fencing.*" parameters.
> Basically, the configuration settings of other "dfs.ha.fencing" are not 
> checked and verified at initialization but directly parsed and applied at 
> runtime. Any configuration errors would prevent the auto-failover. 
> Since the values are used to deal with failures (auto-failover) so you won't 
> notice the errors until the active NameNode fails and triggers the fence 
> procedure in the auto-failover process.
> ============================
> Parameters
> -------------------------------------------------
> In SSHFence, there are two configuration parameters defined in 
> SshFenceByTcpPort.java
> "dfs.ha.fencing.ssh.connect-timeout";
> "dfs.ha.fencing.ssh.private-key-files"
> They are used in the tryFence() function for auto-fencing. 
> Any erroneous settings of these two parameters would result in uncaught 
> exceptions that would prevent the fencing and autofailover. We have verified 
> this by setting a two-NameNode autofailover cluster and manually kill the 
> active NameNode. The passive NameNode cannot takeover. 
> For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include 
> ill-formatted integers and negative integers for 
> dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).
> For "dfs.ha.fencing.ssh.private-key-files",  the erroneous settings include 
> non-existent private-key file path or wrong permissions that fail 
> jsch.addIdentity() in the createSession() method.
> The following gives one example of the failure casued by misconfiguring the 
> "dfs.ha.fencing.ssh.private-key-files" parameter.
> {code}
> 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: ====== 
> Beginning Service Fencing Process... ======
> 2015-02-02 23:38:32,960 INFO org.apache.hadoop.ha.NodeFencer: Trying method 
> 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2015-02-02 23:38:32,960 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable 
> to create SSH session
> com.jcraft.jsch.JSchException: java.io.FileNotFoundException: 
> /home/hadoop/.ssh/id_rsax (No such file or directory)
>         at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:98)
>         at com.jcraft.jsch.JSch.addIdentity(JSch.java:206)
>         at com.jcraft.jsch.JSch.addIdentity(JSch.java:192)
>         at 
> org.apache.hadoop.ha.SshFenceByTcpPort.createSession(SshFenceByTcpPort.java:122)
>         at 
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:91)
>         at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>         at 
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> Caused by: java.io.FileNotFoundException: /home/hadoop/.ssh/id_rsax (No such 
> file or directory)
>         at java.io.FileInputStream.open(Native Method)
>         at java.io.FileInputStream.<init>(FileInputStream.java:146)
>         at java.io.FileInputStream.<init>(FileInputStream.java:101)
>         at com.jcraft.jsch.IdentityFile.newInstance(IdentityFile.java:83)
>         ... 14 more
> {code}
> ============================
> Solution (the patch)
> -------------------------------------------------
> Check the configuration settings in the checkArgs() function. Currently, 
> checkArg() only checks the settings of the parameter "dfs.ha.fencing.methods" 
> but not the settings of the other "dfs.ha.fencing.*" parameters.
> {code:title=SshFenceByTcpPort.java|borderStyle=solid}
>   /**
>    * Verify that the argument, if given, in the conf is parseable.
>    */
>   @Override
>   public void checkArgs(String argStr) throws 
> BadFencingConfigurationException {
>     if (argStr != null) {
>       new Args(argStr);
>     }
>     <= Insert the checkers here (see the patch attached)
>   }
> {code}
> The detailed patch is shown below.
> {code}
> @@ -76,6 +77,23 @@
>      if (argStr != null) {
>        new Args(argStr);
>      }
> +
> +    //The configuration could be empty (e.g., called from DFSHAAdmin)
> +    if(getConf().size() > 0) {
> +      //check ssh.connect-timeout
> +      if(getSshConnectTimeout() <= 0)
> +        throw new BadFencingConfigurationException(
> +            CONF_CONNECT_TIMEOUT_KEY + 
> +            "property value should be positive and non-zero");
> +
> +      //check the settings of dfs.ha.fencing.ssh.private-key-files
> +      for (String keyFilePath : getKeyFiles()) {
> +        File keyFile = new File(keyFilePath);
> +        if(!keyFile.isFile() || !keyFile.canRead())
> +            throw new BadFencingConfigurationException(
> +                "The configured private key file is invalid: " + 
> keyFilePath);
> +      }
> +    }
>    }
>  
>    @Override
> {code}
> Thanks! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to