Guozhang Wang created KAFKA-12693:
-------------------------------------
Summary: Consecutive rebalances with zombie instances may cause
corrupted changelogs
Key: KAFKA-12693
URL: https://issues.apache.org/jira/browse/KAFKA-12693
Project: Kafka
Issue Type: Bug
Reporter: Guozhang Wang
When an instance (or thread within an instance) of Kafka Streams has a soft
failure and the group coordinator triggers a rebalance, that instance would
temporarily become a "zombie writer". That is, this instance does not know
there's already a new rebalance and hence its partitions have been migrated
out, until it tries to commit and then got notified of the illegal-generation
error and realize itself is the "zombie" already. During this period until the
commit, this zombie may still be writing data to the changelogs of the migrated
tasks as the new owner has already taken over and also writing to the
changelogs.
When EOS is enabled, this would not be a problem: when the zombie tries to
commit and got notified that it's fenced, its zombie appends would be aborted.
With EOS disabled, though, such shared writes would be interleaved on the
changelogs where a zombie append may arrive later after the new writer's
append, effectively overwriting that new append.
Note that such interleaving writes do not necessarily cause corrupted data: as
long as the new producer keep appending after the old zombie stops, and all the
corrupted keys are overwritten again by the new values, then it is fine.
However, if there are consecutive rebalances where right after the changelogs
are corrupted by zombie writers, and before the new writer can overwrite them
again, the task gets migrated again and needs to be restored from changelogs,
the old values would be restored instead of the new values, effectively causing
data loss.
Although this should be a rare event, we should fix it asap still. One idea is
to have producers get a PID even under ALOS: that is, we set the transactional
id in the producer config, but did not trigger any txn APIs; when there are
zombie producers, they would then be immediately fenced on appends and hence
there's no interleaved appends. I think this may require a KIP still, since
today one has to call initTxn in order to register and get the PID.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)