dawidwys commented on a change in pull request #15589:
URL: https://github.com/apache/flink/pull/15589#discussion_r613878226
##########
File path:
flink-tests/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterStopWithSavepointITCase.java
##########
@@ -242,6 +258,27 @@ public void
testRestartCheckpointCoordinatorIfStopWithSavepointFails() throws Ex
assertTrue(checkpointsToWaitFor.await(60L, TimeUnit.SECONDS));
}
+ private void stopWithSavepointAndWait(boolean terminate)
+ throws InterruptedException, ExecutionException {
+ final int attempts = 15;
+ for (int i = 0; i < attempts; i++) {
+ try {
+ stopWithSavepoint(terminate).get();
+ return;
+ } catch (Exception ex) {
+ Optional<CheckpointException> checkpointExceptionOptional =
+ ExceptionUtils.findThrowable(ex,
CheckpointException.class);
+ if (!checkpointExceptionOptional.isPresent()
+ ||
checkpointExceptionOptional.get().getCheckpointFailureReason()
+ != NOT_ALL_REQUIRED_TASKS_RUNNING) {
+ throw ex;
+ }
+ // back off before triggering savepoint again
+ Thread.sleep(100);
+ }
+ }
+ }
+
Review comment:
It would work if we knew everything is running, however I don't know how
to achieve that. I could not find a way to reliably get that information.
The number of tries is arbitrary. There is no true logic behind picking 15,
I can decrease the number.
There is one additional trick that actually we expect the
`stopWithSavepoint` to fail in some of the tests. However we never expect it to
fail with the `NOT_ALL_REQUIRED_TASKS_RUNNING` and we only ever retry if that
is the cause.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]