[ 
https://issues.apache.org/jira/browse/HDFS-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900589#comment-16900589
 ] 

Hadoop QA commented on HDFS-14652:
----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} HDFS-14652 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-14652 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12976780/HDFS-14652.003.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27412/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> HealthMonitor connection retry times should be configurable
> -----------------------------------------------------------
>
>                 Key: HDFS-14652
>                 URL: https://issues.apache.org/jira/browse/HDFS-14652
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: HDFS-14652-001.patch, HDFS-14652-002.patch, 
> HDFS-14652.003.patch
>
>
> On our production HDFS cluster, some client's burst requests cause the tcp 
> kernel queue full on NameNode's host,  since the configuration value of 
> "net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the 
> ZooKeeper Healthmonitor got an connection error like this:
> {code:java}
> WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to 
> monitor health of NameNode at nn_host_name/ip_address:port: Call From 
> zkfc_host_name/ip to nn_host_name:port failed on connection exception: 
> java.net.ConnectException: Connection timed out; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused
> {code}
> This error caused a failover and affects the availability of that cluster, we 
> fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6
> But during working on this issue, we found that the connection retry 
> time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I 
> think it should be configurable, then if we don't want the health-monitor so 
> sensitive, we can change it's behavior by change this configuration



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to