gyfora commented on PR #821: URL: https://github.com/apache/flink-kubernetes-operator/pull/821#issuecomment-2212378196
A small addition to the previous comments. The SnapshotObserver currently also observes and updates the lastSavepointInfo for terminal jobs. This is a key mechanism to be able to handle failures during stateful upgrades. So we have to update that logic as well so that instead of record the lastSavepoint status we use a new shared logic with the snapshot custom resource mechanism. This mechanism is covered by some tests currently, enabling the snapshot CR for those tests would probably help repro the problem. For example: 1. Introduce a failure after executing a stop-with-savepoint operation (before the snapshot CR was created) 2. This is the point where the observer would actually observe the savepoint/checkpoint info from the terminal job and update the status 3. Assert that the upgrade is actually executed from the correct savepoint I believe with the current implementation the savpoint info would be lost -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
