[ 
https://issues.apache.org/jira/browse/KAFKA-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671608#comment-13671608
 ] 

Jun Rao commented on KAFKA-927:
-------------------------------

Thanks for the patch. Overall, a well thought-out patch. Some comments.

1. KafkaController.shutdownBroker: We should probably only do controlled 
shutdown if the controller is active.

2. KafkaApis: If the controller is not active, we should send an errorcode back 
to the ControlledShutdownRequest.

3. KafkaServer: I am not sure that we should call 
replicaManager.replicaFetcherManager.closeAllFetchers() at the beginning of the 
controlled shutdown. Once the fetchers are closed, the affected leaders have to 
wait for the timeout before committing new messages since they have to shrink 
the ISR. Instead, it's better if we let the fetcher to be closed through 
leaderAndIsr requests from the controller.

4. KafkaConfig: Could we consolidate controlledShutdownMaxRetries and 
controlledShutdownEnable to one config controlledShutDownWaitTime? If that 
value is <=0, no controlled shutdown is done. Otherwise, we will try controlled 
shutdown until that time has passed.

5. With the new logic added in ReplicaManager/Partition, I am not sure if the 
old controlled shutdown tool still works properly. Should we just remove the 
tool and the jmx hook?

6. There are new files not included in the patch.
                
> Integrate controlled shutdown into kafka shutdown hook
> ------------------------------------------------------
>
>                 Key: KAFKA-927
>                 URL: https://issues.apache.org/jira/browse/KAFKA-927
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Sriram Subramanian
>            Assignee: Sriram Subramanian
>         Attachments: KAFKA-927.patch
>
>
> The controlled shutdown mechanism should be integrated into the software for 
> better operational benefits. Also few optimizations can be done to reduce 
> unnecessary rpc and zk calls. This patch has been tested on a prod like 
> environment by doing rolling bounces continuously for a day. The average time 
> of doing a rolling bounce with controlled shutdown for a cluster with 7 nodes 
> without this patch is 340 seconds. With this patch it reduces to 220 seconds. 
> Also it ensures correctness in scenarios where the controller shrinks the isr 
> and the new leader could place the broker to be shutdown back into the isr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to