[ 
https://issues.apache.org/jira/browse/HDFS-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896730#comment-16896730
 ] 

Chen Zhang commented on HDFS-14652:
-----------------------------------

Thanks [~jojochuang], I also don't know why these machines is initialized with 
net.ipv4.tcp_syn_retries=1, our company have hundreds of production services, 
different services has different requirements, so maybe it's just some mistake 
made by our DevOps, but it's absolutely not what we want. We've set this config 
to 6 on all hadoop machines.
{quote}Does it help to update ha.health-monitor.rpc-timeout.ms? This is by 
default 45 seconds. We found that bumping it to 90 or even 180 helps to work 
around certain long running HDFS RPCs.
{quote}
Yes, we've updated the ha.health-monitor.rpc-timeout.ms config, it's helpful. 
This Jira is just a proposal that health-monitor have a separate config key for 
rpc-timeout, then the retry times should also be configurable, not hard-coded 
to 1.If we don't want the health-monitor so sensitive, at least we can change 
it's behavior by changing this configuration.

> HealthMonitor connection retry times should be configurable
> -----------------------------------------------------------
>
>                 Key: HDFS-14652
>                 URL: https://issues.apache.org/jira/browse/HDFS-14652
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-14652-001.patch, HDFS-14652-002.patch
>
>
> On our production HDFS cluster, some client's burst requests cause the tcp 
> kernel queue full on NameNode's host,  since the configuration value of 
> "net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the 
> ZooKeeper Healthmonitor got an connection error like this:
> {code:java}
> WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to 
> monitor health of NameNode at nn_host_name/ip_address:port: Call From 
> zkfc_host_name/ip to nn_host_name:port failed on connection exception: 
> java.net.ConnectException: Connection timed out; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused
> {code}
> This error caused a failover and affects the availability of that cluster, we 
> fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6
> But during working on this issue, we found that the connection retry 
> time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I 
> think it should be configurable, then if we don't want the health-monitor so 
> sensitive, we can change it's behavior by change this configuration



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to