[ https://issues.apache.org/jira/browse/KAFKA-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209647#comment-17209647 ]
Sagar Rao commented on KAFKA-10559: ----------------------------------- [~ableegoldman], i looked at the code today.. I see that, INCOMPLETE_SOURCE_TOPIC_METADATA error code is being thrown in 2 places: [https://github.com/confluentinc/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java#L379-L383] and [https://github.com/confluentinc/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java#L345-L351] Both these places, it is being thrown for both TaskAssignmentException and TimeoutException. As per your suggestion, for the case of TimeoutException, i just need to rethrow TimeoutException and that's all I need to do? > Don't shutdown the entire app upon TimeoutException during internal topic > validation > ------------------------------------------------------------------------------------ > > Key: KAFKA-10559 > URL: https://issues.apache.org/jira/browse/KAFKA-10559 > Project: Kafka > Issue Type: Bug > Components: streams > Reporter: Sophie Blee-Goldman > Assignee: Sagar Rao > Priority: Blocker > Fix For: 2.7.0 > > > During some of the KIP-572 work, we made things pretty brittle by changing > the StreamsPartitionAssignor to send the `INCOMPLETE_SOURCE_TOPIC_METADATA` > error code and shut down the entire application if a TimeoutException is hit > during the internal topic creation/validation. > Internal topic validation occurs during every rebalance, and we have seen it > time out on topic discovery in unstable environments. So shutting down the > entire application seems like a step in the wrong direction, and antithetical > to the goal of KIP-572 (improving the resiliency of Streams in the face of > TimeoutExceptions) > I'm not totally sure what the previous behavior was, but it seems to me we > have three options: > # Rethrow the TimeoutException and allow it to kill the thread > # Swallow the TimeoutException and retry the rebalance indefinitely > # Some combination of the above: swallow the TimeoutException but don't > retry indefinitely: > ## Start a timer and allow retrying rebalances for up the configured > task.timeout.ms, the timeout config introduced in KIP-572 > ## Retry for some constant number of rebalances > I think if we go with option 3, then shutting down the entire application is > relatively more palatable, as we have given the environment a chance to > stabilize. > But, killing the thread still seems preferable, given the two new features > that are coming out soon: the ability to start up new threads, and the > improved exception handler that allows the user to choose to shut down the > entire application if that's really what they want. Once users have this > level of control over the application, we should allow them to decide how > they want to handle exceptional cases like this, rather than forcing an > option on them (eg shutdown everything) > > Imo we should fix this before 2.7 comes out, even if it's just a partial fix > (eg we do option 1 in 2.7, but plan to implement option 3 eventually) -- This message was sent by Atlassian Jira (v8.3.4#803005)