[jira] [Commented] (KAFKA-10559) Don't shutdown the entire app upon TimeoutException during internal topic validation

Sagar Rao (Jira) Wed, 07 Oct 2020 09:07:16 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209647#comment-17209647
 ]


Sagar Rao commented on KAFKA-10559:
-----------------------------------

[~ableegoldman], i looked at the code today.. I see that, 
INCOMPLETE_SOURCE_TOPIC_METADATA error code is being thrown in 2 places:

[https://github.com/confluentinc/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java#L379-L383]

and 

[https://github.com/confluentinc/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java#L345-L351]

 

Both these places, it is being thrown for both TaskAssignmentException and 
TimeoutException. As per your suggestion, for the case of TimeoutException, i 
just need to rethrow TimeoutException and that's all I need to do?

> Don't shutdown the entire app upon TimeoutException during internal topic 
> validation
> ------------------------------------------------------------------------------------
>
>                 Key: KAFKA-10559
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10559
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Sophie Blee-Goldman
>            Assignee: Sagar Rao
>            Priority: Blocker
>             Fix For: 2.7.0
>
>
> During some of the KIP-572 work, we made things pretty brittle by changing 
> the StreamsPartitionAssignor to send the `INCOMPLETE_SOURCE_TOPIC_METADATA` 
> error code and shut down the entire application if a TimeoutException is hit 
> during the internal topic creation/validation.
> Internal topic validation occurs during every rebalance, and we have seen it 
> time out on topic discovery in unstable environments. So shutting down the 
> entire application seems like a step in the wrong direction, and antithetical 
> to the goal of KIP-572 (improving the resiliency of Streams in the face of 
> TimeoutExceptions)
> I'm not totally sure what the previous behavior was, but it seems to me we 
> have three options:
>  # Rethrow the TimeoutException and allow it to kill the thread
>  # Swallow the TimeoutException and retry the rebalance indefinitely
>  # Some combination of the above: swallow the TimeoutException but don't 
> retry indefinitely:
>  ## Start a timer and allow retrying rebalances for up the configured 
> task.timeout.ms, the timeout config introduced in KIP-572
>  ## Retry for some constant number of rebalances
> I think if we go with option 3, then shutting down the entire application is 
> relatively more palatable, as we have given the environment a chance to 
> stabilize.
> But, killing the thread still seems preferable, given the two new features 
> that are coming out soon: the ability to start up new threads, and the 
> improved exception handler that allows the user to choose to shut down the 
> entire application if that's really what they want. Once users have this 
> level of control over the application, we should allow them to decide how 
> they want to handle exceptional cases like this, rather than forcing an 
> option on them (eg shutdown everything) 
>  
> Imo we should fix this before 2.7 comes out, even if it's just a partial fix 
> (eg we do option 1 in 2.7, but plan to implement option 3 eventually)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10559) Don't shutdown the entire app upon TimeoutException during internal topic validation

Reply via email to