Rajdeep Mukherjee created KAFKA-8714:
----------------------------------------
Summary: CLOSE_WAIT connections piling up on the broker
Key: KAFKA-8714
URL: https://issues.apache.org/jira/browse/KAFKA-8714
Project: Kafka
Issue Type: Bug
Affects Versions: 2.3.0, 0.10.1.0
Environment: Linux 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Rajdeep Mukherjee
Attachments: Screenshot from 2019-07-25 11-53-24.png,
consumer_multiprocessing.py, producer_multiprocessing.py
We are experiencing an issue where `CLOSE_WAIT` connections are piling up in
the brokers leading to a `Too many open files` error finally leading to a crash
of the corresponding broker. After some digging, we realized that this is
happening at instances when multiple clients(producers or consumers) are
closing their connections within a brief interval of time(when the frequency of
client connection closes is increasing).
The actual error that we had encountered was:
{code:java}
[2019-07-18 00:03:27,861] ERROR Error while accepting connection
(kafka.network.Acceptor) java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:326)
at kafka.network.Acceptor.run(SocketServer.scala:269)
at java.lang.Thread.run(Thread.java:745)
{code}
When the error was encountered, the number of CLOSE_WAIT connections on the
broker was 200,000 and the number of ESTABLISHED connections was approximately
15000.
The attachment shows the issue, the sharp dip in the graph is the point where
the broker restarted.
We had encountered this problem in both kafka version 0.10.1 and 2.3.0
The client version we were using for reproducing was:
{code:java}
confluent-kafka==1.1.0
librdkafka v1.1.0
{code}
Steps to reproduce:
I have attached the scripts we used for reproducing the issue.
In our qa environment we were successfully able to reproduce the issue in the
following way:
* we spun a 5 node kafka v2.3.0 cluster
* we had prepared a python script that would spin in the order of 500+
producer processes and the same number of consumer processes and we had written
in logic to randomly close the producer and consumer connections at a high
frequency in the order of 10 closes per second for 5 minutes.
* On the broker side, we were watching for CLOSE_WAIT connections using `lsof`
and we got sustained CLOSE_WAIT connections that persisted until we restarted
kafka on the corresponding broker.
The command to be run for the producer and consumer scripts are:
{code:java}
python producer_multiprocessing.py <time in seconds> <number of processes
<sleep in seconds between produce> true true
python consumer_multiprocessing.py <time in seconds> <number of processes> 0
true
{code}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)