[ 
https://issues.apache.org/jira/browse/KAFKA-17789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059848#comment-18059848
 ] 

Guang Zhao commented on KAFKA-17789:
------------------------------------

I'm trying to reproduce the problem locally -- may I ask for more context in 
the original setting please --
- did the non-restoring clients eventually start restoring on their own (given 
enough time), or did they remain permanently stuck at zero restorations?
- similarly, when trying to stop the stuck clients, did they eventually shut 
down (after a while say several minutes), or did they hang indefinitely 
requiring a force-kill?

My local setup on trunk shows both the split of restoring clients vs 
non-restoring clients, as well as the shutdown hang, but both are bounded by 
admin client and state updater timeouts (the shutdown can take up to several 
minutes, but eventually completes). So I wonder whether the original setup did 
observe the same bounded behavior, or a truly permanent stuck state.

[~chuckame] - if you have more context to share, that'd be appreciated!  

> State updater stuck when starting with empty state folder
> ---------------------------------------------------------
>
>                 Key: KAFKA-17789
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17789
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.0
>            Reporter: Antoine Michaud
>            Priority: Critical
>
> In an application with multiple clients, each having multiple threads, when 
> the app is started with an empty storage (without resetting the whole 
> application), only a part of the clients are restoring the changelog topics.
> Those non-restoring clients are also not able to shutdown gracefully.
>  
> Reproduction steps
> > I'm putting all the actual details, while I'm going to make a project to 
> > reproduce it locally, and I'll link it inside this ticket.
>  * Having the app in a kubernetes environment, with multiple pods (5) so 
> finally having 5 streams clients, and also enough data or poor cpu to have 
> long restoration (enough to see the issue after 1 or 2 minutes)
>  * Already consumed input topics and be live (no lag on input or internal 
> topics)
>  * then stop the app
>  * clear out the local storage
>  * finally restart and see that only 2 or 3 clients are restoring, the others 
> consuming nothing
>  * Bonus: stop the clients, then the stuck clients should not close and 
> should continue sending heartbeats and answering any rebalance assignment
> Related slack discussion: 
> https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1728296887560369



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to