[ https://issues.apache.org/jira/browse/KAFKA-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673380#comment-13673380 ]
Neha Narkhede commented on KAFKA-927: ------------------------------------- Thanks for the revised v2 patch. Few more comments - 1. KafkaServer 1.1 startupComplete should either be a volatile variable to AtomicBoolean. Two different threads call startup() and controlledShutdown(), which modify startupComplete. 1.2 In controlledShutdown(), we need to handle error codes in ControlledShutdownResponse explicitly. It can happen that the error code is set and partitionsRemaining are 0, which will lead to errors. 2. Partition >From previous review #4, if the broker has to ignore the become follower >request anyway, does it make sense to even process part of it and truncate log >etc ? 3. From previous review #3, I meant that it is pointless to do the ZK write on the controller since right after the write, since the follower hasn't received the stop replica request and the leader hasn't received shrunk isr, the broker being shut down will get added back to ISR. You can verify that this happens from the logs. It also makes controlled shutdown very slow since typically in production we move ~1000 partitions from the broker and zk writes can take ~20ms which means several seconds wasted just doing the ZK writes. Instead, it is enough to let the leader shrink the isr by sending it the leader and isr request. On the other hand, we can argue that the OfflineReplica state change itself should be changed to avoid the ZK write. But that is a bigger change, so we should avoid that right now. > Integrate controlled shutdown into kafka shutdown hook > ------------------------------------------------------ > > Key: KAFKA-927 > URL: https://issues.apache.org/jira/browse/KAFKA-927 > Project: Kafka > Issue Type: Bug > Reporter: Sriram Subramanian > Assignee: Sriram Subramanian > Attachments: KAFKA-927.patch, KAFKA-927-v2.patch, > KAFKA-927-v2-revised.patch > > > The controlled shutdown mechanism should be integrated into the software for > better operational benefits. Also few optimizations can be done to reduce > unnecessary rpc and zk calls. This patch has been tested on a prod like > environment by doing rolling bounces continuously for a day. The average time > of doing a rolling bounce with controlled shutdown for a cluster with 7 nodes > without this patch is 340 seconds. With this patch it reduces to 220 seconds. > Also it ensures correctness in scenarios where the controller shrinks the isr > and the new leader could place the broker to be shutdown back into the isr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira