dxichen opened a new pull request #1571: URL: https://github.com/apache/samza/pull/1571
Symptom: Checkpoint v2 preference could cause stale checkpoints to get picked up when newer v1 checkpoints are available Cause: Kafka checkpoint manager selects the checkpoints based on the `task.checkpoint.read.versions` priority list. This does not account for the fact that the prioritized checkpoint may be stale and a new checkpoint lower on the priority list is present in the checkpoint topic. This will cause the job to silently use the stale checkpoint and cause more reprocessing than the most recent commit. An example case where this could happen is when the users upgrades to a v2 checkpoint commits a v2 checkpoint and rollbacks to a previous version using only v1 checkpoint version then upgrading to the newest version again, using that stale v2 checkpoint from the initial upgrade. Changes: Added a `task.live.checkpoint.max.age` default 10 mins (must be > task.commit.ms interval; internal config) which would indicate if a checkpoint has gone stale if: a) There exists a checkpoint of another type more recent than it b) The diff between new checkpoint's kafka log append time and the prioritized version's log append time > `task.live.checkpoint.max.age' Tests: Integration tests writing to the checkpoint topic with stale checkpoint v2 and consuming it with a job that prioritize v2 over v1 checkpoints with `task.live.checkpoint.max.age' = 0 asserting it prioritizes v1 checkpoints instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
