[ https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Candice Wan updated KAFKA-7802: ------------------------------- Description: We recently upgraded to 2.1.0. Since then, several times per day, we observe some brokers were disconnected when other brokers were trying to fetch the replicas. This issue took down the whole cluster, making all the producers and consumers not able to publish or consume messages. Here is an example of what we're seeing in the broker which was trying to send fetch request to the problematic one: 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. 2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[178])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, epoch=1599941)) java.io.IOException: Connection to 3 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199) at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) We also took the thread dump of the problematic broker (attached). We found all the kafka-request-handler were hanging and waiting for some locks, which seemed to be a resource leak there. The java version we are running is 11.0.1 was: We recently upgraded to 2.1.0. Since then, several times per day, we observe some brokers were disconnected when other brokers were trying to fetch the replicas. This issue took down the whole cluster, making all the producers and consumers not able to publish or consume messages. Here is an example of what we're seeing in the broker which was trying to send fetch request to the problematic one: 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. 2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read. 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData=\{__consumer_offsets-11=(offset=421032847, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[178])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, epoch=1599941)) java.io.IOException: Connection to 3 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199) at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) We also took the thread dump of the problematic broker (attached). We found all the kafka-request-handler were hanging and waiting for some locks, which seemed to be a resource leak there. FYI java version we are running is 11.0.1 > Connection to Broker Disconnected Taking Down the Whole Cluster > --------------------------------------------------------------- > > Key: KAFKA-7802 > URL: https://issues.apache.org/jira/browse/KAFKA-7802 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.1.0 > Reporter: Candice Wan > Priority: Critical > Attachments: thread_dump.log > > > We recently upgraded to 2.1.0. Since then, several times per day, we observe > some brokers were disconnected when other brokers were trying to fetch the > replicas. This issue took down the whole cluster, making all the producers > and consumers not able to publish or consume messages. > Here is an example of what we're seeing in the broker which was trying to > send fetch request to the problematic one: > 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO > o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, > fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) > to node 3: java.io.IOException: Connection to 3 was disconnected before the > response was read. > 2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO > o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, > fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) > to node 3: java.io.IOException: Connection to 3 was disconnected before the > response was read. > 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN > kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, > fetcherId=0] Error in response for fetch request (type=FetchRequest, > replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, > fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, > maxBytes=1048576, currentLeaderEpoch=Optional[178])}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, > epoch=1599941)) > java.io.IOException: Connection to 3 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199) > at > kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > at > kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > > We also took the thread dump of the problematic broker (attached). We found > all the kafka-request-handler were hanging and waiting for some locks, which > seemed to be a resource leak there. > > The java version we are running is 11.0.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)