gyfora commented on PR #821:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/821#issuecomment-2212378196

   A small addition to the previous comments. The SnapshotObserver currently 
also observes and updates the lastSavepointInfo for terminal jobs. This is a 
key mechanism to be able to handle failures during stateful upgrades.
   
   So we have to update that logic as well  so that instead of record the 
lastSavepoint status we use a new shared logic with the snapshot custom 
resource mechanism. This mechanism is covered by some tests currently, enabling 
the snapshot CR for those tests would probably help repro the problem. For 
example:
   
   1. Introduce a failure after executing a stop-with-savepoint operation 
(before the snapshot CR was created)
   2. This is the point where the observer would actually observe the 
savepoint/checkpoint info from the terminal job and update the status 
   3. Assert that the upgrade is actually executed from the correct savepoint 
   
   I believe with the current implementation the savpoint info would be lost


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to