[
https://issues.apache.org/jira/browse/KAFKA-12693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guozhang Wang updated KAFKA-12693:
----------------------------------
Labels: new-streams-runtime-should-fix streams (was: streams)
> Consecutive rebalances with zombie instances may cause corrupted changelogs
> ---------------------------------------------------------------------------
>
> Key: KAFKA-12693
> URL: https://issues.apache.org/jira/browse/KAFKA-12693
> Project: Kafka
> Issue Type: Bug
> Reporter: Guozhang Wang
> Priority: Major
> Labels: new-streams-runtime-should-fix, streams
>
> When an instance (or thread within an instance) of Kafka Streams has a soft
> failure and the group coordinator triggers a rebalance, that instance would
> temporarily become a "zombie writer". That is, this instance does not know
> there's already a new rebalance and hence its partitions have been migrated
> out, until it tries to commit and then got notified of the illegal-generation
> error and realize itself is the "zombie" already. During this period until
> the commit, this zombie may still be writing data to the changelogs of the
> migrated tasks as the new owner has already taken over and also writing to
> the changelogs.
> When EOS is enabled, this would not be a problem: when the zombie tries to
> commit and got notified that it's fenced, its zombie appends would be
> aborted. With EOS disabled, though, such shared writes would be interleaved
> on the changelogs where a zombie append may arrive later after the new
> writer's append, effectively overwriting that new append.
> Note that such interleaving writes do not necessarily cause corrupted data:
> as long as the new producer keep appending after the old zombie stops, and
> all the corrupted keys are overwritten again by the new values, then it is
> fine. However, if there are consecutive rebalances where right after the
> changelogs are corrupted by zombie writers, and before the new writer can
> overwrite them again, the task gets migrated again and needs to be restored
> from changelogs, the old values would be restored instead of the new values,
> effectively causing data loss.
> Although this should be a rare event, we should fix it asap still. One idea
> is to have producers get a PID even under ALOS: that is, we set the
> transactional id in the producer config, but did not trigger any txn APIs;
> when there are zombie producers, they would then be immediately fenced on
> appends and hence there's no interleaved appends. I think this may require a
> KIP still, since today one has to call initTxn in order to register and get
> the PID.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)