[ 
https://issues.apache.org/jira/browse/KAFKA-14292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14292.
-------------------------------------
    Fix Version/s: 3.4.0
                   3.3.2
       Resolution: Fixed

> KRaft broker controlled shutdown can be delayed indefinitely
> ------------------------------------------------------------
>
>                 Key: KAFKA-14292
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14292
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Alyssa Huang
>            Priority: Major
>             Fix For: 3.4.0, 3.3.2
>
>
> We noticed when rolling a kraft cluster that it took an unexpectedly long 
> time for one of the brokers to shutdown. In the logs, we saw the following:
> {code:java}
> Oct 11, 2022 @ 17:53:38.277   [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283357 is 
> not greater than the broker's shutdown offset 2283358. 
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG   
> 2Oct 11, 2022 @ 17:53:38.277  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283362.  
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG   
> 3Oct 11, 2022 @ 17:53:40.278  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283366.  
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG   
> 4Oct 11, 2022 @ 17:53:40.278  [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283361 is 
> not greater than the broker's shutdown offset 2283362. 
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG   
> 5Oct 11, 2022 @ 17:53:42.279  [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283365 is 
> not greater than the broker's shutdown offset 2283366. 
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG   
> 6Oct 11, 2022 @ 17:53:42.279  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283370.  
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG   
> 7Oct 11, 2022 @ 17:53:44.280  [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283369 is 
> not greater than the broker's shutdown offset 2283370. 
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG   
> 8Oct 11, 2022 @ 17:53:44.281  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283374.  
> org.apache.kafka.controller.BrokerHeartbeatManager      DEBUG    {code}
> From what I can tell, it looks like the controller waits until all brokers 
> have caught up to the {{controlledShutdownOffset}} of the broker that is 
> shutting down before allowing it to proceed. Probably the intent is to make 
> sure they have all the leader and ISR state.
> The problem is that the {{controlledShutdownOffset}} seems to be updated 
> after every heartbeat that the controller receives: 
> https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L1996.
>  Unless all other brokers can catch up to that offset before the next 
> heartbeat from the shutting down broker is received, then the broker remains 
> in the shutting down state indefinitely.
> In this case, it took more than 40 minutes before the broker completed 
> shutdown:
> {code:java}
> 1Oct 11, 2022 @ 18:36:36.105  [Controller 1] The request from broker 8 to 
> shut down has been granted since the lowest active offset 2288510 is now 
> greater than the broker's controlled shutdown offset 2288510.      
> org.apache.kafka.controller.BrokerHeartbeatManager      INFO    
> 2Oct 11, 2022 @ 18:40:35.197  [Controller 1] The request from broker 8 to 
> unfence has been granted because it has caught up with the offset of it's 
> register broker record 2288906.   
> org.apache.kafka.controller.BrokerHeartbeatManager      INFO{code}
> It seems like the bug here is that we should not keep updating 
> {{controlledShutdownOffset}} if it has already been set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to