[
https://issues.apache.org/jira/browse/HDFS-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901535#comment-16901535
]
Hudson commented on HDFS-14652:
-------------------------------
FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17051 (See
[https://builds.apache.org/job/Hadoop-trunk-Commit/17051/])
HDFS-14652. Addendum: HealthMonitor connection retry times should be (weichiu:
rev 8cef9f89f4218971199363f1809401c8305ede9b)
* (edit) hadoop-common-project/hadoop-common/src/main/resources/core-default.xml
> HealthMonitor connection retry times should be configurable
> -----------------------------------------------------------
>
> Key: HDFS-14652
> URL: https://issues.apache.org/jira/browse/HDFS-14652
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Chen Zhang
> Assignee: Chen Zhang
> Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14652-001.patch, HDFS-14652-002.patch,
> HDFS-14652.003.patch
>
>
> On our production HDFS cluster, some client's burst requests cause the tcp
> kernel queue full on NameNode's host, since the configuration value of
> "net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the
> ZooKeeper Healthmonitor got an connection error like this:
> {code:java}
> WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to
> monitor health of NameNode at nn_host_name/ip_address:port: Call From
> zkfc_host_name/ip to nn_host_name:port failed on connection exception:
> java.net.ConnectException: Connection timed out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
> {code}
> This error caused a failover and affects the availability of that cluster, we
> fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6
> But during working on this issue, we found that the connection retry
> time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I
> think it should be configurable, then if we don't want the health-monitor so
> sensitive, we can change it's behavior by change this configuration
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]