[ 
https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated HDFS-7727:
-----------------------------
    Attachment: check_config_SshFenceByTcpPort.1.patch

> Check and verify the auto-fence settings to prevent failures of auto-failover
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-7727
>                 URL: https://issues.apache.org/jira/browse/HDFS-7727
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>    Affects Versions: 2.4.1, 2.6.0, 2.5.1
>            Reporter: Tianyin Xu
>         Attachments: check_config_SshFenceByTcpPort.1.patch
>
>
> Sorry for reporting similar problems, but the problems resides in different 
> components, and this one has more severe consequence (well, this's my last 
> report of this type of problems). 
> ============================
> Problem
> -------------------------------------------------
> The problem is similar as the following issues resolved in Yarn,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> and reported (by me) in HDFS EditLogTailer,
> https://issues.apache.org/jira/browse/HDFS-7726
> Basically, the configuration settings is not checked and verified at 
> initialization but directly parsed and applied at runtime. Any configuration 
> errors would impair the corresponding components (since the exceptions are 
> not caught). 
> In this case, the values are used in auto-failover so you won't notice the 
> errors until one of the NameNode fails and triggers the fence procedure in 
> the auto-failover process.
> ============================
> Parameters
> -------------------------------------------------
> In SSHFence, there are two configuration parameters defined in 
> SshFenceByTcpPort.java
> "dfs.ha.fencing.ssh.connect-timeout";
> "dfs.ha.fencing.ssh.private-key-files"
> They are used in the tryFence() function for auto-fencing. 
> Any erroneous settings of these two parameters would result in uncaught 
> exceptions that would prevent the fencing and impair autofailover. We have 
> verified this by setting a two-NameNode autofailover cluster and manually 
> kill the active NameNode. The passive NameNode cannot takeover successfully. 
> For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include 
> ill-formatted integers and negative integers for 
> dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).
> For "dfs.ha.fencing.ssh.private-key-files",  the erroneous settings include 
> non-existent private-key file path or wrong permissions that fail 
> jsch.addIdentity() in the createSession() method.
> I think actively checking the settings in the constructor of the class (in 
> the same way as YARN-2165, YARN-2166, HDFS-7726) should be able to fix the 
> problems.
> Thanks! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to