[ 
https://issues.apache.org/jira/browse/KAFKA-10555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210968#comment-17210968
 ] 

Bruno Cadonna commented on KAFKA-10555:
---------------------------------------

[~ableegoldman] Thank you for your detailed analysis!

First of all, I agree that transitioning to the ERROR state should not leave 
the Kafka Streams client in a zombie state. The client should be closed and all 
its threads should be shutdown.

I think, operators can achieve safety first also without an ERROR state. They 
could set the appropriate signals (e.g. do not restart this client) from the 
uncaught exception handler without relying on the state of the Kafka Streams 
client. I can imagine that taking the appropriate measures in the uncaught 
exception handler is even more flexible for the operators, because they can 
directly call existing APIs of their cluster environment. 

If we still decide to stick to the ERROR state, I agree with you on cases #1 
and #2. 

For case #3, the current proposal in KIP-663 is to stay in RUNNING. I would not 
go to NOT_RUNNING if we define NOT_RUNNING to be the state the client ends up 
after it is closed and all its threads -- including cleanup thread etc. -- are 
shutdown as implied by case #1. Maybe we need another state NOT_PROCESSING 
([~wcarlson5]'s idea). In that case I would propose to call the state 
NOT_POLLING to be more exact, because you could be not processing records also 
with 20 alive stream threads that continuously poll records.

Case #4 is a sign to me that we should probably get rid of the ERROR state and 
users should take care of this case in the uncaught exception handler.

Case #5 is similar to case #3. Should we even merge this two cases? 

For case #6, we first need to exactly define NOT_RUNNING. If it is the state 
the client ends up after calling close(), we cannot start a new stream thread 
anymore.
  
I think, we need to define the states in more detail before we discuss to which 
state we transit. I propose the following states:
CREATED: The Kafka Streams client is created but not started
REBALANCING: The stream threads of the Kafka Streams client are rebalancing
RUNNING: At least one stream thread is alive or only the global stream thread 
is alive and polls records.
NOT_POLLING: The client does not poll because no stream threads and no global 
thread is alive. A stream thread can be added. (Should we stop -- not shutdown 
-- the cleanup thread in this state and resume it when the client transits to 
RUNNING?)   
PENDING_SHUTDOWN: The Kafka Streams client is closing
NOT_RUNNING (terminal state): The Kafka Streams client is closed, i.e., all 
stream threads are shutdown and all maintenance threads (e.g. cleanup thread) 
are also shutdown.
(ERROR (terminal state): Same as NOT_RUNNING, but it also signals "restart at 
your own risk")




> Improve client state machine
> ----------------------------
>
>                 Key: KAFKA-10555
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10555
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Major
>              Labels: needs-kip
>
> The KafkaStreams client exposes its state to the user for monitoring purpose 
> (ie, RUNNING, REBALANCING etc). The state of the client depends on the 
> state(s) of the internal StreamThreads that have their own states.
> Furthermore, the client state has impact on what the user can do with the 
> client. For example, active task can only be queried in RUNNING state and 
> similar.
> With KIP-671 and KIP-663 we improved error handling capabilities and allow to 
> add/remove stream thread dynamically. We allow adding/removing threads only 
> in RUNNING and REBALANCING state. This puts us in a "weird" position, because 
> if we enter ERROR state (ie, if the last thread dies), we cannot add new 
> threads and longer. However, if we have multiple threads and one dies, we 
> don't enter ERROR state and do allow to recover the thread.
> Before the KIPs the definition of ERROR state was clear, however, with both 
> KIPs it seem that we should revisit its semantics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to