[
https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780015#comment-17780015
]
Matthias J. Sax commented on KAFKA-12550:
-----------------------------------------
Hmmm... Aren't both things kinda independent? In the end, as an operator I
might still be interested to see if the state-updated thread is doing active
restore (or maintaining standby, or is doing nothing)?
> Introduce RESTORING state to the KafkaStreams FSM
> -------------------------------------------------
>
> Key: KAFKA-12550
> URL: https://issues.apache.org/jira/browse/KAFKA-12550
> Project: Kafka
> Issue Type: Improvement
> Components: streams
> Reporter: A. Sophie Blee-Goldman
> Priority: Major
> Labels: needs-kip
>
> We should consider adding a new state to the KafkaStreams FSM: RESTORING
> This would cover the time between the completion of a stable rebalance and
> the completion of restoration across the client. Currently, Streams will
> report the state during this time as REBALANCING even though it is generally
> spending much more time restoring than rebalancing in most cases.
> There are a few motivations/benefits behind this idea:
> # Observability is a big one: using the umbrella REBALANCING state to cover
> all aspects of rebalancing -> task initialization -> restoring has been a
> common source of confusion in the past. It’s also proved to be a time sink
> for us, during escalations, incidents, mailing list questions, and bug
> reports. It often adds latency to escalations in particular as we have to go
> through GTS and wait for the customer to clarify whether their “Kafka Streams
> is stuck rebalancing” ticket means that it’s literally rebalancing, or just
> in the REBALANCING state and actually stuck elsewhere in Streams
> # Prereq for global thread improvements: for example [KIP-406:
> GlobalStreamThread should honor custom reset policy
> |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy]
> was ultimately blocked on this as we needed to pause the Streams app while
> the global thread restored from the appropriate offset. Since there’s
> absolutely no rebalancing involved in this case, piggybacking on the
> REBALANCING state would just be shooting ourselves in the foot.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)