Guozhang Wang created KAFKA-12693:
-------------------------------------

             Summary: Consecutive rebalances with zombie instances may cause 
corrupted changelogs
                 Key: KAFKA-12693
                 URL: https://issues.apache.org/jira/browse/KAFKA-12693
             Project: Kafka
          Issue Type: Bug
            Reporter: Guozhang Wang


When an instance (or thread within an instance) of Kafka Streams has a soft 
failure and the group coordinator triggers a rebalance, that instance would 
temporarily become a "zombie writer". That is, this instance does not know 
there's already a new rebalance and hence its partitions have been migrated 
out, until it tries to commit and then got notified of the illegal-generation 
error and realize itself is the "zombie" already. During this period until the 
commit, this zombie may still be writing data to the changelogs of the migrated 
tasks as the new owner has already taken over and also writing to the 
changelogs.

When EOS is enabled, this would not be a problem: when the zombie tries to 
commit and got notified that it's fenced, its zombie appends would be aborted. 
With EOS disabled, though, such shared writes would be interleaved on the 
changelogs where a zombie append may arrive later after the new writer's 
append, effectively overwriting that new append.

Note that such interleaving writes do not necessarily cause corrupted data: as 
long as the new producer keep appending after the old zombie stops, and all the 
corrupted keys are overwritten again by the new values, then it is fine. 
However, if there are consecutive rebalances where right after the changelogs 
are corrupted by zombie writers, and before the new writer can overwrite them 
again, the task gets migrated again and needs to be restored from changelogs, 
the old values would be restored instead of the new values, effectively causing 
data loss.

Although this should be a rare event, we should fix it asap still. One idea is 
to have producers get a PID even under ALOS: that is, we set the transactional 
id in the producer config, but did not trigger any txn APIs; when there are 
zombie producers, they would then be immediately fenced on appends and hence 
there's no interleaved appends. I think this may require a KIP still, since 
today one has to call initTxn in order to register and get the PID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to