[
https://issues.apache.org/jira/browse/KAFKA-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981382#comment-15981382
]
Ismael Juma commented on KAFKA-5116:
------------------------------------
cc [~onurkaraman] [~junrao]
> Controller updates to ISR holds the controller lock for a very long time
> ------------------------------------------------------------------------
>
> Key: KAFKA-5116
> URL: https://issues.apache.org/jira/browse/KAFKA-5116
> Project: Kafka
> Issue Type: Bug
> Components: controller
> Affects Versions: 0.10.1.0, 0.10.2.0
> Reporter: Justin Downing
> Fix For: 0.11.0.0
>
>
> Hello!
> Lately, we have noticed slow (or no) results when monitoring the broker's ISR
> using JMX. Many of these requests appear to be 'hung' for a very long time
> (eg: >2m). We've dug a bunch, and found that in our case, sometimes the
> controllerLock can be held for multiple minutes in the IsrChangeNotifier
> callback.
> Inside the lock, we are reading from Zookeeper for *each* partition in the
> changeset. With a large changeset (eg: >500 partitions), [this
> operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347]
> can take a long time to complete.
> In KAFKA-2406, throttling was introduced to prevent overwhelming the
> controller with many changesets at once. However, this does not take into
> consideration _large_ changesets.
> We have identified two potential remediations we'd like to discuss further:
> * Move the Zookeeper request outside of the lock. This would then only lock
> for the controller update and processing of the changeset.
> * Send limited changesets to Zookeeper when calling the
> maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it
> may be useful to batch the changesets in groups of 100 rather the send the
> [entire
> list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204]
> to Zookeeper at once.
> We're happy working on patches for either or both of these, but we are unsure
> of the safety around these two proposals. Specifically, moving the Zookeeper
> request out of the lock may be unsafe.
> Holding these locks for long periods of time seems problematic - it means
> that broker failure won't be detected and acted upon quickly.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)