[ https://issues.apache.org/jira/browse/KAFKA-10555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210968#comment-17210968 ]
Bruno Cadonna commented on KAFKA-10555: --------------------------------------- [~ableegoldman] Thank you for your detailed analysis! First of all, I agree that transitioning to the ERROR state should not leave the Kafka Streams client in a zombie state. The client should be closed and all its threads should be shutdown. I think, operators can achieve safety first also without an ERROR state. They could set the appropriate signals (e.g. do not restart this client) from the uncaught exception handler without relying on the state of the Kafka Streams client. I can imagine that taking the appropriate measures in the uncaught exception handler is even more flexible for the operators, because they can directly call existing APIs of their cluster environment. If we still decide to stick to the ERROR state, I agree with you on cases #1 and #2. For case #3, the current proposal in KIP-663 is to stay in RUNNING. I would not go to NOT_RUNNING if we define NOT_RUNNING to be the state the client ends up after it is closed and all its threads -- including cleanup thread etc. -- are shutdown as implied by case #1. Maybe we need another state NOT_PROCESSING ([~wcarlson5]'s idea). In that case I would propose to call the state NOT_POLLING to be more exact, because you could be not processing records also with 20 alive stream threads that continuously poll records. Case #4 is a sign to me that we should probably get rid of the ERROR state and users should take care of this case in the uncaught exception handler. Case #5 is similar to case #3. Should we even merge this two cases? For case #6, we first need to exactly define NOT_RUNNING. If it is the state the client ends up after calling close(), we cannot start a new stream thread anymore. I think, we need to define the states in more detail before we discuss to which state we transit. I propose the following states: CREATED: The Kafka Streams client is created but not started REBALANCING: The stream threads of the Kafka Streams client are rebalancing RUNNING: At least one stream thread is alive or only the global stream thread is alive and polls records. NOT_POLLING: The client does not poll because no stream threads and no global thread is alive. A stream thread can be added. (Should we stop -- not shutdown -- the cleanup thread in this state and resume it when the client transits to RUNNING?) PENDING_SHUTDOWN: The Kafka Streams client is closing NOT_RUNNING (terminal state): The Kafka Streams client is closed, i.e., all stream threads are shutdown and all maintenance threads (e.g. cleanup thread) are also shutdown. (ERROR (terminal state): Same as NOT_RUNNING, but it also signals "restart at your own risk") > Improve client state machine > ---------------------------- > > Key: KAFKA-10555 > URL: https://issues.apache.org/jira/browse/KAFKA-10555 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Matthias J. Sax > Priority: Major > Labels: needs-kip > > The KafkaStreams client exposes its state to the user for monitoring purpose > (ie, RUNNING, REBALANCING etc). The state of the client depends on the > state(s) of the internal StreamThreads that have their own states. > Furthermore, the client state has impact on what the user can do with the > client. For example, active task can only be queried in RUNNING state and > similar. > With KIP-671 and KIP-663 we improved error handling capabilities and allow to > add/remove stream thread dynamically. We allow adding/removing threads only > in RUNNING and REBALANCING state. This puts us in a "weird" position, because > if we enter ERROR state (ie, if the last thread dies), we cannot add new > threads and longer. However, if we have multiple threads and one dies, we > don't enter ERROR state and do allow to recover the thread. > Before the KIPs the definition of ERROR state was clear, however, with both > KIPs it seem that we should revisit its semantics. -- This message was sent by Atlassian Jira (v8.3.4#803005)