[jira] [Commented] (KAFKA-13333) Optimize condition for triggering rebalance after wiping out corrupted task

Colt McNealy (Jira) Fri, 16 May 2025 20:12:05 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952316#comment-17952316
 ]


Colt McNealy commented on KAFKA-13333:
--------------------------------------

> If the task wasn't caught up on this host but assigned to it anyway, that 
> indicates there wasn't any other host with enough state for this task and 
> therefore no one to temporarily take it over

Hmm. Is that still true? What if a standby task has made significant progress 
towards rehydrating the state in the time that the active task was restoring 
before the corruption occurred?

 

This might be a moot point with KIP-1071 though (:

> Optimize condition for triggering rebalance after wiping out corrupted task
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-13333
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13333
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Priority: Major
>              Labels: new-streams-runtime-should-fix
>
> Just filing a ticket to list some thoughts I had on optimizing 
> https://issues.apache.org/jira/browse/KAFKA-12486. 
> The idea here is to trigger a rebalance upon detecting corruption of some 
> task. This task may have had a large amount of state that had to be wiped out 
> under eos, so we might be able to avoid a long downtime due to restoration if 
> we can utilize the HA TaskAssignor to temporarily move that active task to 
> another node that has some state for it already (eg had a standby task for 
> it).
> Right now, we trigger that rebalance under the condition that (a) eos is 
> enabled, and (b) at least one of the corrupted tasks was an active task. This 
> is a pretty safe bet, but it's worth jotting down some potential 
> optimizations of this condition so we can trim down the occurrences of 
> unnecessary rebalances that wouldn't have helped. For example:
> 1) Don't kick off a rebalance if the corrupted task is in CREATED or 
> RESTORING, and is not within the acceptable.recovery.lag from the end of the 
> changelog. If the task wasn't caught up on this host but assigned to it 
> anyway, that indicates there wasn't any other host with enough state for this 
> task and therefore no one to temporarily take it over
> 2) Only trigger a rebalance if standbys are configured, and/or parse the 
> standby host info to verify whether this task has a standby copy on another 
> live client. It's still possible to have a copy of this task's state on 
> another host even without standbys, but the odds are greatly reduced.
> 3) If we want to get really fancy (and I'm not quite sure we do), we could 
> have the assignor report not just the names but also the lag of each standby 
> task on another host, and then trigger the rebalance depending on whether 
> this task has a hot standby within the acceptable.recovery.lag



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-13333) Optimize condition for triggering rebalance after wiping out corrupted task

Reply via email to