Tim Patterson created KAFKA-13600:
-------------------------------------
Summary: Rebalances while streams is in degraded state can stores
to be assigned and restore from scratch
Key: KAFKA-13600
URL: https://issues.apache.org/jira/browse/KAFKA-13600
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 3.0.0, 2.8.1
Reporter: Tim Patterson
Consider this scenario:
# A node is lost from the cluster.
# A rebalance is kicked off with a new "target assignment"'s(ie the rebalance
is attempting to move a lot of tasks - see
https://issues.apache.org/jira/browse/KAFKA-10121).
# The kafka cluster is now a bit more sluggish from the increased load.
# A Rolling Deploy happens triggering rebalances, during the rebalance
processing continues but offsets can't be committed(Or nodes are restarted but
fail to commit offsets)
# The most caught up nodes now aren't within `acceptableRecoveryLag` and so
the task is started in it's "target assignment" location, restoring all state
from scratch and delaying further processing instead of using the "almost
caught up" node.
We've hit this a few times and having lots of state (~25TB worth) and being
heavy users of IQ this is not ideal for us.
While we can increase `acceptableRecoveryLag` to larger values to try get
around this that causes other issues (ie a warmup becoming active when its
still quite far behind)
The solution seems to be to balance "balanced assignment" with "most caught up
nodes".
We've got a fork where we do just this and it's made a huge difference to the
reliability of our cluster.
Our change is to simply use the most caught up node if the "target node" is
more than `acceptableRecoveryLag` behind.
This gives up some of the load balancing type behaviour of the existing code
but in practise doesn't seem to matter too much.
I guess maybe an algorithm that identified candidate nodes as those being
within `acceptableRecoveryLag` of the most caught up node might allow the best
of both worlds.
Our fork is
[https://github.com/apache/kafka/compare/trunk...tim-patterson:fix_balance_uncaughtup?expand=1]
(We also moved the capacity constraint code to happen after all the stateful
assignment to prioritise standby tasks over warmup tasks)
Ideally we don't want to maintain a fork of kafka streams going forward so are
hoping to get a bit of discussion / agreement on the best way to handle this.
More than happy to contribute code/test different algo's in production system
or anything else to help with this issue
--
This message was sent by Atlassian Jira
(v8.20.1#820001)