benWize commented on pull request #15664: URL: https://github.com/apache/beam/pull/15664#issuecomment-938138029
Hi, @ibzib would you help me to review this? I think the cause is that `triggerSavepoint` in https://github.com/apache/beam/blob/master/runners/flink/src/test/java/org/apache/beam/runners/flink/FlinkSavepointTest.java#L259 is called twice as shown in the logs here https://scans.gradle.com/s/tdagq66c7f4n2/tests/:runners:flink:1.13:test/org.apache.beam.runners.flink.FlinkSavepointTest/testSavepointRestoreLegacy/1/output (A log with the message: `Triggering cancel-with-savepoint` is shown twice). The first loop tries to trigger the savepoint and cancel, if the first call in the loop throws an exception, a second loop tries to trigger the savepoint while some tasks are been canceled, which causes a fail with the message `Failed to trigger checkpoint for job xxx since Checkpoint triggering task Source: Impulse (1/1) of job xxx is not being executed at the moment. Aborting checkpoint. Failure reason: Not all required tasks are currently running.` I was able to reproduce the error shown in the logs locally, executing `triggerSavepoint` twice. My proposed fix is to split trigger and cancel to prevent canceling if the savepoint operator fails on its first call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
