[ https://issues.apache.org/jira/browse/ZOOKEEPER-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719736#comment-16719736 ]
Michael Han commented on ZOOKEEPER-3211: ---------------------------------------- {quote}Have similar defects been solved in 3.4.13? {quote} Previously there were reports about CLOSE_WAIT, but if I remember correctly, most of those cases ended up no actions taken because it was hard to reproduce. {quote}It looks like zk Server is deadlocked {quote} The thread dump in 1.log file indicates some threads are blocked, but that seems a symptom rather than the cause. If we run out available sockets then some zookeeper threads that involves file IO / socket IO will be blocked. {quote}Does this cause CLOSE_WAIT for zk? {quote} Most of time, long living CLOSE_WAIT connections indicate an application side bug instead of kernel bug - that the connection should be closed but for some reasons the application, after receiving TCP reset from clients can't close the connection - which effectively leaks connections. The upgrade of kernel could be a trigger though. I am interested to know if any other folks can reproduce this. I currently don't have the environment to reproduce this. Also, [~yss] can you please use zip file instead of rar file for uploading log files? Another thing to try is to increase your limit of open file descriptors - seems its currently set as 60? If you increase it (ulimit), you could end up still leaking connections but the server should be available before its running out of sockets. > zookeeper standalone mode,found a high level bug in kernel of centos7.0 > ,zookeeper Server's tcp/ip socket connections(default 60 ) are CLOSE_WAIT > ,this lead to zk can't work for client any more > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-3211 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3211 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.5 > Environment: 1.zoo.cfg > server.1=127.0.0.1:2902:2903 > 2.kernel > kernel:Linux localhost.localdomain 3.10.0-123.el7.x86_64 #1 SMP Tue Feb 12 > 19:44:50 EST 2019 x86_64 x86_64 x86_64 GNU/Linux > JDK: > java version "1.7.0_181" > OpenJDK Runtime Environment (rhel-2.6.14.5.el7-x86_64 u181-b00) > OpenJDK 64-Bit Server VM (build 24.181-b00, mixed mode) > zk: 3.4.5 > Reporter: yeshuangshuang > Priority: Blocker > Fix For: 3.4.5 > > Attachments: 1.log, zklog.rar > > Original Estimate: 168h > Remaining Estimate: 168h > > 1.config--zoo.cfg > server.1=127.0.0.1:2902:2903 > 2.kernel version > version:Linux localhost.localdomain 3.10.0-123.el7.x86_64 #1 SMP Tue Feb 12 > 19:44:50 EST 2019 x86_64 x86_64 x86_64 GNU/Linux > JDK: > java version "1.7.0_181" > OpenJDK Runtime Environment (rhel-2.6.14.5.el7-x86_64 u181-b00) > OpenJDK 64-Bit Server VM (build 24.181-b00, mixed mode) > zk: 3.4.5 > 3.bug details: > Occasionally,But the recurrence probability is extremely high. At first, the > read-write timeout takes about 6s, and after a few minutes, all connections > (including long ones) will be CLOSE_WAIT state. > 4.:Circumvention scheme: it is found that all connections become close_wait > to restart the zookeeper server side actively -- This message was sent by Atlassian JIRA (v7.6.3#76005)