[ https://issues.apache.org/jira/browse/KAFKA-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dong Lin updated KAFKA-5116: ---------------------------- Fix Version/s: (was: 2.1.0) 2.2.0 > Controller updates to ISR holds the controller lock for a very long time > ------------------------------------------------------------------------ > > Key: KAFKA-5116 > URL: https://issues.apache.org/jira/browse/KAFKA-5116 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.10.1.0, 0.10.2.0 > Reporter: Justin Downing > Priority: Major > Fix For: 2.2.0 > > > Hello! > Lately, we have noticed slow (or no) results when monitoring the broker's ISR > using JMX. Many of these requests appear to be 'hung' for a very long time > (eg: >2m). We've dug a bunch, and found that in our case, sometimes the > controllerLock can be held for multiple minutes in the IsrChangeNotifier > callback. > Inside the lock, we are reading from Zookeeper for *each* partition in the > changeset. With a large changeset (eg: >500 partitions), [this > operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347] > can take a long time to complete. > In KAFKA-2406, throttling was introduced to prevent overwhelming the > controller with many changesets at once. However, this does not take into > consideration _large_ changesets. > We have identified two potential remediations we'd like to discuss further: > * Move the Zookeeper request outside of the lock. This would then only lock > for the controller update and processing of the changeset. > * Send limited changesets to Zookeeper when calling the > maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it > may be useful to batch the changesets in groups of 100 rather the send the > [entire > list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204] > to Zookeeper at once. > We're happy working on patches for either or both of these, but we are unsure > of the safety around these two proposals. Specifically, moving the Zookeeper > request out of the lock may be unsafe. > Holding these locks for long periods of time seems problematic - it means > that broker failure won't be detected and acted upon quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)