Chen Zhang created HDFS-14652:
---------------------------------

             Summary: HealthMonitor connection retry times should be 
configurable
                 Key: HDFS-14652
                 URL: https://issues.apache.org/jira/browse/HDFS-14652
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Chen Zhang


On our production HDFS cluster, some client's burst requests cause the tcp 
kernel queue full on NameNode's host,  since the configuration value of 
"net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the 
ZooKeeper Healthmonitor got an connection error like this:
{code:java}
WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to 
monitor health of NameNode at nn_host_name/ip_address:port: Call From 
zkfc_host_name/ip to nn_host_name:port failed on connection exception: 
java.net.ConnectException: Connection timed out; For more details see: 
http://wiki.apache.org/hadoop/ConnectionRefused
{code}
This error caused a failover and affects the availability of that cluster, we 
fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6

But during working on this issue, we found that the connection retry 
time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I 
think it should be configurable, then if we don't want the health-monitor so 
sensitive, we can change it's behavior by change this configuration



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to