Candice Wan created KAFKA-7802:
----------------------------------

             Summary: Connection to Broker Disconnected Taking Down the Whole 
Cluster
                 Key: KAFKA-7802
                 URL: https://issues.apache.org/jira/browse/KAFKA-7802
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 2.1.0
            Reporter: Candice Wan
         Attachments: thread_dump.log

We recently upgraded to 2.1.0. Since then, several times per day, we observe 
some brokers were disconnected when other brokers were trying to fetch the 
replicas. This issue took down the whole cluster, making all the producers and 
consumers not able to publish or consume messages.

Here is an example of what we're seeing in the broker which was trying to send 
fetch request to the problematic one:

2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
to node 3: java.io.IOException: Connection to 3 was disconnected before the 
response was read.
2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
to node 3: java.io.IOException: Connection to 3 was disconnected before the 
response was read.
2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData=\{__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
epoch=1599941))
java.io.IOException: Connection to 3 was disconnected before the response was 
read
 at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
 at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
 at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
 at 
kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
 at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
 at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
 at scala.Option.foreach(Option.scala:257)
 at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
 at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

 

We also took the thread dump of the problematic broker (attached). We found all 
the kafka-request-handler were hanging and waiting for some locks, which seemed 
to be a resource leak there.

 

FYI java version we are running is 11.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to