[
https://issues.apache.org/jira/browse/KAFKA-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673182#comment-13673182
]
Jun Rao commented on KAFKA-927:
-------------------------------
Thanks for patch v2. A few more comments:
20. KafkaController: If when shutdownBroker is called, the controller is no
longer active, both state machines will throw an exception on state change
calls. However, the issue is that we add the shutdown broker to
controllerContext.shuttingDownBrokerIds and it's never reset. This may become a
problem if this broker becomes a controller again. At the minimum, we need to
reset controllerContext.shuttingDownBrokerIds in onControllerFailover().
However, I am a bit confused why we never reset
controllerContext.shuttingDownBrokerIds and the shutdown logic still works.
21. ControlledShutdownRequest.handleError(): We should probably set
partitionsRemaining in ControlledShutdownResponse to empty instead of null,
since the serialization of ControlledShutdownResponse doesn't handle
partitionsRemaining being null.
22. testRollingBounce:
22.1 The test makes sure that the leader for topic1 is changed after broker 0
is shutdown. However, the leader for topic1 could be on broker 1 initially. In
this case, the leader won't be changed after broker 0 is shutdown.
22.2 The default controlledShutdownRetryBackoffMs is 5secs, which is probably
too long for the unit test.
23. KafkaServer: We need to handle the errorCode in ControlledShutdownResponse
since the controller may have moved after we send the ControlledShutdown
request.
>From the previous review:
3. I think a simple solution is to (1) not call
replicaManager.replicaFetcherManager.closeAllFetchers() in KafkaServer during
shutdown; (2) in KafkaController.shutdownBroker(), for each partition on the
shutdown broker, we first send a stopReplicaRequest to it for that partition
before going through the state machine logic. Since the state machine logic
involves ZK reads/writes, it's very likely that the stopReplicaRequest will
reach the broker before the subsequent LeaderAndIsr requests. So, in most
cases, the leader should be able to shrink ISR quicker than the timeout,
without churns in ISR.
> Integrate controlled shutdown into kafka shutdown hook
> ------------------------------------------------------
>
> Key: KAFKA-927
> URL: https://issues.apache.org/jira/browse/KAFKA-927
> Project: Kafka
> Issue Type: Bug
> Reporter: Sriram Subramanian
> Assignee: Sriram Subramanian
> Attachments: KAFKA-927.patch, KAFKA-927-v2.patch
>
>
> The controlled shutdown mechanism should be integrated into the software for
> better operational benefits. Also few optimizations can be done to reduce
> unnecessary rpc and zk calls. This patch has been tested on a prod like
> environment by doing rolling bounces continuously for a day. The average time
> of doing a rolling bounce with controlled shutdown for a cluster with 7 nodes
> without this patch is 340 seconds. With this patch it reduces to 220 seconds.
> Also it ensures correctness in scenarios where the controller shrinks the isr
> and the new leader could place the broker to be shutdown back into the isr.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira