[ https://issues.apache.org/jira/browse/KAFKA-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320671#comment-17320671 ]
GeoffreyStark commented on KAFKA-8714: -------------------------------------- maybe the same issue I created https://issues.apache.org/jira/browse/KAFKA-12665[https://issues.apache.org/jira/browse/KAFKA-12665] > CLOSE_WAIT connections piling up on the broker > ---------------------------------------------- > > Key: KAFKA-8714 > URL: https://issues.apache.org/jira/browse/KAFKA-8714 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.1.0, 2.3.0 > Environment: Linux 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 > 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > Reporter: Rajdeep Mukherjee > Priority: Major > Attachments: Screenshot from 2019-07-25 11-53-24.png, > consumer_multiprocessing.py, producer_multiprocessing.py > > > We are experiencing an issue where `CLOSE_WAIT` connections are piling up in > the brokers leading to a `Too many open files` error finally leading to a > crash of the corresponding broker. After some digging, we realized that this > is happening at instances when multiple clients(producers or consumers) are > closing their connections within a brief interval of time(when the frequency > of client connection closes is increasing). > The actual error that we had encountered was: > {code:java} > [2019-07-18 00:03:27,861] ERROR Error while accepting connection > (kafka.network.Acceptor) java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at kafka.network.Acceptor.accept(SocketServer.scala:326) > at kafka.network.Acceptor.run(SocketServer.scala:269) > at java.lang.Thread.run(Thread.java:745) > {code} > When the error was encountered, the number of CLOSE_WAIT connections on the > broker was 200,000 and the number of ESTABLISHED connections was > approximately 15000. > The attachment shows the issue, the sharp dip in the graph is the point where > the broker restarted. > We had encountered this problem in both kafka version 0.10.1 and 2.3.0 > The client version we were using for reproducing was: > > {code:java} > confluent-kafka==1.1.0 > librdkafka v1.1.0 > {code} > > Steps to reproduce: > I have attached the scripts we used for reproducing the issue. > In our qa environment we were successfully able to reproduce the issue in the > following way: > * we spun a 5 node kafka v2.3.0 cluster > * we had prepared a python script that would spin in the order of 500+ > producer processes and the same number of consumer processes and we had > written in logic to randomly close the producer and consumer connections at a > high frequency in the order of 10 closes per second for 5 minutes. > * On the broker side, we were watching for CLOSE_WAIT connections using > `lsof` and we got sustained CLOSE_WAIT connections that persisted until we > restarted kafka on the corresponding broker. > The command to be run for the producer and consumer scripts are: > {code:java} > python producer_multiprocessing.py <time in seconds> <number of processes > <sleep in seconds between produce> true true > python consumer_multiprocessing.py <time in seconds> <number of processes> 0 > true > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)