[ https://issues.apache.org/jira/browse/KAFKA-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200088#comment-16200088 ]
ASF GitHub Bot commented on KAFKA-6051: --------------------------------------- GitHub user mayt opened a pull request: https://github.com/apache/kafka/pull/4056 KAFKA-6051 Close the ReplicaFetcherBlockingSend earlier on shutdown Rearranged the testAddPartitionDuringDeleteTopic() test to keep the likelyhood of the race condition. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mayt/kafka KAFKA-6051 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/4056.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4056 ---- commit 36c1fa6ca3bab4dc070910cba9223f4141982d82 Author: Maytee Chinavanichkit <maytee.chinavanich...@linecorp.com> Date: 2017-10-11T10:35:54Z KAFKA-6051 Close the ReplicaFetcherBlockingSend earlier on shutdown Rearranged the testAddPartitionDuringDeleteTopic() test to keep the likelyhood of the race condition. ---- > ReplicaFetcherThread should close the ReplicaFetcherBlockingSend earlier on > shutdown > ------------------------------------------------------------------------------------ > > Key: KAFKA-6051 > URL: https://issues.apache.org/jira/browse/KAFKA-6051 > Project: Kafka > Issue Type: Bug > Reporter: Maytee Chinavanichkit > > The ReplicaFetcherBlockingSend works as designed and will blocks until it is > able to get data. This becomes a problem when we are gracefully shutting down > a broker. The controller will attempt to shutdown the fetchers and elect new > leaders. When the last fetch of partition is removed, as part of the > {replicaManager.becomeLeaderOrFollower} call will proceed to shut down any > idle ReplicaFetcherThread. The shutdown process here can block up to until > the last fetch request completes. This blocking delay is a big problem > because the {replicaStateChangeLock}, and {mapLock} in > {AbstractFetcherManager} is still locked causing latency spikes on multiple > brokers. > At this point in time, we do not need the last response as the fetcher is > shutting down. We should close the leaderEndpoint early during > {initiateShutdown()} instead of after {super.shutdown()}. > For example we see here the shutdown blocked the broker from processing more > replica changes for ~500 ms > {code} > [2017-09-01 18:11:42,879] INFO [ReplicaFetcherThread-0-2], Shutting down > (kafka.server.ReplicaFetcherThread) > [2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Stopped > (kafka.server.ReplicaFetcherThread) > [2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Shutdown completed > (kafka.server.ReplicaFetcherThread) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)