[ https://issues.apache.org/jira/browse/HDFS-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tianyin Xu updated HDFS-7727: ----------------------------- Attachment: check_config_SshFenceByTcpPort.1.patch > Check and verify the auto-fence settings to prevent failures of auto-failover > ----------------------------------------------------------------------------- > > Key: HDFS-7727 > URL: https://issues.apache.org/jira/browse/HDFS-7727 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.4.1, 2.6.0, 2.5.1 > Reporter: Tianyin Xu > Attachments: check_config_SshFenceByTcpPort.1.patch > > > Sorry for reporting similar problems, but the problems resides in different > components, and this one has more severe consequence (well, this's my last > report of this type of problems). > ============================ > Problem > ------------------------------------------------- > The problem is similar as the following issues resolved in Yarn, > https://issues.apache.org/jira/browse/YARN-2165 > https://issues.apache.org/jira/browse/YARN-2166 > and reported (by me) in HDFS EditLogTailer, > https://issues.apache.org/jira/browse/HDFS-7726 > Basically, the configuration settings is not checked and verified at > initialization but directly parsed and applied at runtime. Any configuration > errors would impair the corresponding components (since the exceptions are > not caught). > In this case, the values are used in auto-failover so you won't notice the > errors until one of the NameNode fails and triggers the fence procedure in > the auto-failover process. > ============================ > Parameters > ------------------------------------------------- > In SSHFence, there are two configuration parameters defined in > SshFenceByTcpPort.java > "dfs.ha.fencing.ssh.connect-timeout"; > "dfs.ha.fencing.ssh.private-key-files" > They are used in the tryFence() function for auto-fencing. > Any erroneous settings of these two parameters would result in uncaught > exceptions that would prevent the fencing and impair autofailover. We have > verified this by setting a two-NameNode autofailover cluster and manually > kill the active NameNode. The passive NameNode cannot takeover successfully. > For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include > ill-formatted integers and negative integers for > dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()). > For "dfs.ha.fencing.ssh.private-key-files", the erroneous settings include > non-existent private-key file path or wrong permissions that fail > jsch.addIdentity() in the createSession() method. > I think actively checking the settings in the constructor of the class (in > the same way as YARN-2165, YARN-2166, HDFS-7726) should be able to fix the > problems. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)