[ 
https://issues.apache.org/jira/browse/KAFKA-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635780#comment-16635780
 ] 

Dong Lin commented on KAFKA-5116:
---------------------------------

Moving this to 2.2.0 since PR is not ready yet.

> Controller updates to ISR holds the controller lock for a very long time
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-5116
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5116
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.10.1.0, 0.10.2.0
>            Reporter: Justin Downing
>            Priority: Major
>             Fix For: 2.2.0
>
>
> Hello!
> Lately, we have noticed slow (or no) results when monitoring the broker's ISR 
> using JMX. Many of these requests appear to be 'hung' for a very long time 
> (eg: >2m). We've dug a bunch, and found that in our case, sometimes the 
> controllerLock can be held for multiple minutes in the IsrChangeNotifier 
> callback.
> Inside the lock, we are reading from Zookeeper for *each* partition in the 
> changeset. With a large changeset (eg: >500 partitions), [this 
> operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347]
>  can take a long time to complete. 
> In KAFKA-2406, throttling was introduced to prevent overwhelming the 
> controller with many changesets at once. However, this does not take into 
> consideration _large_ changesets.
> We have identified two potential remediations we'd like to discuss further:
> * Move the Zookeeper request outside of the lock. This would then only lock 
> for the controller update and processing of the changeset.
> * Send limited changesets to Zookeeper when calling the 
> maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it 
> may be useful to batch the changesets in groups of 100 rather the send the 
> [entire 
> list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204]
>  to Zookeeper at once.
> We're happy working on patches for either or both of these, but we are unsure 
> of the safety around these two proposals. Specifically, moving the Zookeeper 
> request out of the lock may be unsafe.
> Holding these locks for long periods of time seems problematic - it means 
> that broker failure won't be detected and acted upon quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to