[
https://issues.apache.org/jira/browse/HDFS-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896730#comment-16896730
]
Chen Zhang commented on HDFS-14652:
-----------------------------------
Thanks [~jojochuang], I also don't know why these machines is initialized with
net.ipv4.tcp_syn_retries=1, our company have hundreds of production services,
different services has different requirements, so maybe it's just some mistake
made by our DevOps, but it's absolutely not what we want. We've set this config
to 6 on all hadoop machines.
{quote}Does it help to update ha.health-monitor.rpc-timeout.ms? This is by
default 45 seconds. We found that bumping it to 90 or even 180 helps to work
around certain long running HDFS RPCs.
{quote}
Yes, we've updated the ha.health-monitor.rpc-timeout.ms config, it's helpful.
This Jira is just a proposal that health-monitor have a separate config key for
rpc-timeout, then the retry times should also be configurable, not hard-coded
to 1.If we don't want the health-monitor so sensitive, at least we can change
it's behavior by changing this configuration.
> HealthMonitor connection retry times should be configurable
> -----------------------------------------------------------
>
> Key: HDFS-14652
> URL: https://issues.apache.org/jira/browse/HDFS-14652
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Chen Zhang
> Assignee: Chen Zhang
> Priority: Major
> Attachments: HDFS-14652-001.patch, HDFS-14652-002.patch
>
>
> On our production HDFS cluster, some client's burst requests cause the tcp
> kernel queue full on NameNode's host, since the configuration value of
> "net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the
> ZooKeeper Healthmonitor got an connection error like this:
> {code:java}
> WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to
> monitor health of NameNode at nn_host_name/ip_address:port: Call From
> zkfc_host_name/ip to nn_host_name:port failed on connection exception:
> java.net.ConnectException: Connection timed out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
> {code}
> This error caused a failover and affects the availability of that cluster, we
> fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6
> But during working on this issue, we found that the connection retry
> time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I
> think it should be configurable, then if we don't want the health-monitor so
> sensitive, we can change it's behavior by change this configuration
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]