Greg Harris created KAFKA-15905:
-----------------------------------
Summary: Restarts of MirrorCheckpointTask should not permanently
interrupt offset translation
Key: KAFKA-15905
URL: https://issues.apache.org/jira/browse/KAFKA-15905
Project: Kafka
Issue Type: Improvement
Components: mirrormaker
Affects Versions: 3.6.0
Reporter: Greg Harris
Executive summary: When the MirrorCheckpointTask restarts, it loses the state
of checkpointsPerConsumerGroup, which limits offset translation to records
mirrored after the latest restart.
For example, if 1000 records are mirrored and the OffsetSyncs are read by
MirrorCheckpointTask, the emitted checkpoints are cached, and translation can
happen at the ~500th record. If MirrorCheckpointTask restarts, and 1000 more
records are mirrored, translation can happen at the ~1500th record, but no
longer at the ~500th record.
Context:
Before KAFKA-13659, MM2 made translation decisions based on the
incompletely-initialized OffsetSyncStore, and the checkpoint could appear to go
backwards temporarily during restarts. To fix this, we forced the
OffsetSyncStore to initialize completely before translation could take place,
ensuring that the latest OffsetSync had been read, and thus providing the most
accurate translation.
Before KAFKA-14666, MM2 translated offsets only off of the latest OffsetSync.
Afterwards, an in-memory sparse cache of historical OffsetSyncs was kept, to
allow for translation of earlier offsets. This came with the caveat that the
cache's sparseness allowed translations to go backwards permanently. To prevent
this behavior, a cache of the latest Checkpoints was kept in the
MirrorCheckpointTask#checkpointsPerConsumerGroup variable, and offset
translation remained restricted to the fully-initialized OffsetSyncStore.
Effectively, the MirrorCheckpointTask ensures that it translates based on an
OffsetSync emitted during it's lifetime, to ensure that no previous
MirrorCheckpointTask emitted a later sync. If we can read the checkpoints
emitted by previous generations of MirrorCheckpointTask, we can still ensure
that checkpoints are monotonic, while allowing translation further back in
history.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)