[
https://issues.apache.org/jira/browse/KAFKA-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136069#comment-17136069
]
Guozhang Wang commented on KAFKA-10134:
---------------------------------------
This is a bit weird to me -- discover of coordinator logic did not change from
2.4 -> 2.5 AFAIK.
[~seanguo] could you list the configs of consumer when you used cooperative
rebalance, v.s. eager rebalance?
> High CPU issue during rebalance in Kafka consumer after upgrading to 2.5
> ------------------------------------------------------------------------
>
> Key: KAFKA-10134
> URL: https://issues.apache.org/jira/browse/KAFKA-10134
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 2.5.0
> Reporter: Sean Guo
> Priority: Major
>
> We want to utilize the new rebalance protocol to mitigate the stop-the-world
> effect during the rebalance as our tasks are long running task.
> But after the upgrade when we try to kill an instance to let rebalance happen
> when there is some load(some are long running tasks >30S) there, the CPU will
> go sky-high. It reads ~700% in our metrics so there should be several threads
> are in a tight loop. We have several consumer threads consuming from
> different partitions during the rebalance. This is reproducible in both the
> new CooperativeStickyAssignor and old eager rebalance rebalance protocol. The
> difference is that with old eager rebalance rebalance protocol used the high
> CPU usage will dropped after the rebalance done. But when using cooperative
> one, it seems the consumers threads are stuck on something and couldn't
> finish the rebalance so the high CPU usage won't drop until we stopped our
> load. Also a small load without long running task also won't cause continuous
> high CPU usage as the rebalance can finish in that case.
>
> "executor.kafka-consumer-executor-4" #124 daemon prio=5 os_prio=0
> cpu=76853.07ms elapsed=841.16s tid=0x00007fe11f044000 nid=0x1f4 runnable
> [0x00007fe119aab000]"executor.kafka-consumer-executor-4" #124 daemon prio=5
> os_prio=0 cpu=76853.07ms elapsed=841.16s tid=0x00007fe11f044000 nid=0x1f4
> runnable [0x00007fe119aab000] java.lang.Thread.State: RUNNABLE at
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:467)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1275)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1241)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1216)
> at
>
> By debugging into the code we found it looks like the clients are in a loop
> on finding the coordinator.
> I also tried the old rebalance protocol for the new version the issue still
> exists but the CPU will be back to normal when the rebalance is done.
> Also tried the same on the 2.4.1 which seems don't have this issue. So it
> seems related something changed between 2.4.1 and 2.5.0.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)