[ https://issues.apache.org/jira/browse/KAFKA-13333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952316#comment-17952316 ]
Colt McNealy commented on KAFKA-13333: -------------------------------------- > If the task wasn't caught up on this host but assigned to it anyway, that > indicates there wasn't any other host with enough state for this task and > therefore no one to temporarily take it over Hmm. Is that still true? What if a standby task has made significant progress towards rehydrating the state in the time that the active task was restoring before the corruption occurred? This might be a moot point with KIP-1071 though (: > Optimize condition for triggering rebalance after wiping out corrupted task > --------------------------------------------------------------------------- > > Key: KAFKA-13333 > URL: https://issues.apache.org/jira/browse/KAFKA-13333 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: A. Sophie Blee-Goldman > Priority: Major > Labels: new-streams-runtime-should-fix > > Just filing a ticket to list some thoughts I had on optimizing > https://issues.apache.org/jira/browse/KAFKA-12486. > The idea here is to trigger a rebalance upon detecting corruption of some > task. This task may have had a large amount of state that had to be wiped out > under eos, so we might be able to avoid a long downtime due to restoration if > we can utilize the HA TaskAssignor to temporarily move that active task to > another node that has some state for it already (eg had a standby task for > it). > Right now, we trigger that rebalance under the condition that (a) eos is > enabled, and (b) at least one of the corrupted tasks was an active task. This > is a pretty safe bet, but it's worth jotting down some potential > optimizations of this condition so we can trim down the occurrences of > unnecessary rebalances that wouldn't have helped. For example: > 1) Don't kick off a rebalance if the corrupted task is in CREATED or > RESTORING, and is not within the acceptable.recovery.lag from the end of the > changelog. If the task wasn't caught up on this host but assigned to it > anyway, that indicates there wasn't any other host with enough state for this > task and therefore no one to temporarily take it over > 2) Only trigger a rebalance if standbys are configured, and/or parse the > standby host info to verify whether this task has a standby copy on another > live client. It's still possible to have a copy of this task's state on > another host even without standbys, but the odds are greatly reduced. > 3) If we want to get really fancy (and I'm not quite sure we do), we could > have the assignor report not just the names but also the lag of each standby > task on another host, and then trigger the rebalance depending on whether > this task has a hot standby within the acceptable.recovery.lag -- This message was sent by Atlassian Jira (v8.20.10#820010)