A. Sophie Blee-Goldman created KAFKA-13333:
----------------------------------------------
Summary: Optimize condition for triggering rebalance after wiping
out corrupted task
Key: KAFKA-13333
URL: https://issues.apache.org/jira/browse/KAFKA-13333
Project: Kafka
Issue Type: Improvement
Components: streams
Reporter: A. Sophie Blee-Goldman
Just filing a ticket to list some thoughts I had on optimizing
https://issues.apache.org/jira/browse/KAFKA-12486.
The idea here is to trigger a rebalance upon detecting corruption of some task.
This task may have had a large amount of state that had to be wiped out under
eos, so we might be able to avoid a long downtime due to restoration if we can
utilize the HA TaskAssignor to temporarily move that active task to another
node that has some state for it already (eg had a standby task for it).
Right now, we trigger that rebalance under the condition that (a) eos is
enabled, and (b) at least one of the corrupted tasks was an active task. This
is a pretty safe bet, but it's worth jotting down some potential optimizations
of this condition so we can trim down the occurrences of unnecessary rebalances
that wouldn't have helped. For example:
1) Don't kick off a rebalance if the corrupted task is in CREATED or RESTORING,
and is not within the acceptable.recovery.lag from the end of the changelog. If
the task wasn't caught up on this host but assigned to it anyway, that
indicates there wasn't any other host with enough state for this task and
therefore no one to temporarily take it over
2) Only trigger a rebalance if standbys are configured, and/or parse the
standby host info to verify whether this task has a standby copy on another
live client. It's still possible to have a copy of this task's state on another
host even without standbys, but the odds are greatly reduced.
3) If we want to get really fancy (and I'm not quite sure we do), we could have
the assignor report not just the names but also the lag of each standby task on
another host, and then trigger the rebalance depending on whether this task has
a hot standby within the acceptable.recovery.lag
--
This message was sent by Atlassian Jira
(v8.3.4#803005)