[ 
https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741843#comment-16741843
 ] 

Candice Wan commented on KAFKA-7802:
------------------------------------

when 2.1.1 is expected to be released?

> Connection to Broker Disconnected Taking Down the Whole Cluster
> ---------------------------------------------------------------
>
>                 Key: KAFKA-7802
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7802
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.1.0
>            Reporter: Candice Wan
>            Priority: Critical
>         Attachments: thread_dump.log
>
>
> We recently upgraded to 2.1.0. Since then, several times per day, we observe 
> some brokers were disconnected when other brokers were trying to fetch the 
> replicas. This issue took down the whole cluster, making all the producers 
> and consumers not able to publish or consume messages. It could be quickly 
> fixed by restarting the problematic broker.
> Here is an example of what we're seeing in the broker which was trying to 
> send fetch request to the problematic one:
> 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
> epoch=1599941))
>  java.io.IOException: Connection to 3 was disconnected before the response 
> was read
>  at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>  at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
>  at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
>  at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
>  Below is the suspicious log of the problematic broker when the issue 
> happened:
> 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
> failed, rem
>  oving it from the group
>  2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-instance1 in state PreparingRebalance with old 
> generation 27
>  0 (__consumer_offsets-47) (reason: removing member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
> Notifications.ASIA1546980352799 has failed, removing it from the group
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
> with old generation 1 (__consumer_offsets-44) (reason: removing member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> Notifications.ASIA1546980352799 with generation 2 is now empty 
> (__consumer_offsets-44)
>  2019-01-09 08:04:50.388 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-3-0a4c55c2-9f31-4e7a-b0d7-1f057dceb03d in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 in group talon-kafka-vision 
> has failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-kafka-vision in state PreparingRebalance with old 
> generation 9 (__consumer_offsets-26) (reason: removing member 
> consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 on heartbeat expiration)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> talon-kafka-vision with generation 10 is now empty (__consumer_offsets-26)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 in group 
> talon-hades-instance1 has failed, removing it from the group
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-hades-instance1 in state PreparingRebalance with old 
> generation 122 (__consumer_offsets-11) (reason: removing member 
> consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 on heartbeat expiration)
>  2019-01-09 08:04:50.419 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> talon-hades-instance1 with generation 123 is now empty (__consumer_offsets-11)
>  2019-01-09 08:04:50.422 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-4-a527e579-7a14-471b-b19d-ffec50074bb8 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.434 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-4-0c470e05-5e9a-4cae-a493-9854a6d0c8a7 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:50.514 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-155ea6c8-c90f-4af6-b65e-138a151d77d9 in group talon-instance1 has 
> failed, removing it from the group
>  2019-01-09 08:04:55.297 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group Notifications.ASIA1546980352799: The group is 
> rebalancing, so a rejoin is needed.
>  2019-01-09 08:04:55.419 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group talon-kafka-vision: The group is rebalancing, so a 
> rejoin is needed.
>  2019-01-09 08:04:55.420 [executor-Produce] WARN 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write 
> empty metadata for group talon-hades-instance1: The group is rebalancing, so 
> a rejoin is needed.
>  
> We also took the thread dump of the problematic broker (attached). We found 
> all the kafka-request-handler were hanging and waiting for some locks, which 
> seemed to be a resource leak there.
>  
> The java version we are running is 11.0.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to