Noam Berman created KAFKA-6565:
----------------------------------
Summary: Sudden unresponsiveness from broker + file descriptor leak
Key: KAFKA-6565
URL: https://issues.apache.org/jira/browse/KAFKA-6565
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 0.10.1.0
Environment: 3 separate clusters running 0.10.1.0 with 5 nodes + 5
zookeepers.
clients are mostly java 0.10.2.1, and a few nodejs clients
(https://github.com/oleksiyk/kafka).
throughput per broker: ~3.5MB per second. each broker is a leader of ~2500
partitions.
Reporter: Noam Berman
Running 3 kafka clusters v0.10.1.0 on aws, 5 brokers each (+5 zookeepers), in
the past week about 2-3 times a day (on all clusters, no specific brokers) we
encounter a situation in which a single broker stops responding to all requests
from clients/other brokers. at this moment all produced messages fail, and
other brokers start writing this to log repeatedly:
[2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch
kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 878 was disconnected before the response was
read
at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at
kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
Once this happens, file descriptors on the affected broker start increasing at
a high rate until it reaches its maximum.
no relevant info logs on the affected broker, we'll try to gather debug logs
and attach them to the bug next time it happens.
the big issue here is that although the broker is unresponsive, it stays in the
cluster and its zookeeper node isn't cleared, so there's no rebalance
whatsoever - all clients start failing until a manual shutdown occurs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)