[ https://issues.apache.org/jira/browse/KAFKA-13501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867880#comment-17867880 ]
Matthias J. Sax commented on KAFKA-13501: ----------------------------------------- I don't think state updated solve this? Also not sure why it's labeled with "new-streams-runtime-should-fix" – I don't see how. It a task fails locally, and we would restart the task locally we rebuild state from scratch. With state-updater the same happens. The difference w/ state-update is "only" (can be significant) that we would not block all tasks from processing any longer, but keep processing all other tasks, while state-updater does the restore. However, for the failed task, we still have offline time. The idea of this ticket was to say: if we have two instance A and B, and the local failures happens on A, and B has a standby, let's trigger a rebalance, and move the failed task to B to avoid offline time for the failed task all together. – On instance A, we might still re-build the state using state-updated, but B would take over processing in the mean time. And after A is done restoring, we could do another rebalance, to move the active back from B to A (and still keep a standby on B). Does this make sense? (Maybe the ticket description was too brief?) > Avoid state restore via rebalance if standbys are enabled > --------------------------------------------------------- > > Key: KAFKA-13501 > URL: https://issues.apache.org/jira/browse/KAFKA-13501 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Matthias J. Sax > Priority: Major > Labels: new-streams-runtime-should-fix > > There are certain scenario in which Kafka Streams wipes out local state and > rebuilt it from scratch. This is a thread local cleanup, ie, no rebalance is > triggered, and we end up with an offline task until state restoration > finished. > If standby tasks are enable, it might actually make sense to trigger a > rebalance instead, to get the task re-assigned to the instance hosting the > standby so get the task active again quickly. -- This message was sent by Atlassian Jira (v8.20.10#820010)