rkhachatryan commented on a change in pull request #12611:
URL: https://github.com/apache/flink/pull/12611#discussion_r439007115
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -538,36 +538,45 @@ private void
startTriggeringCheckpoint(CheckpointTriggerRequest request) {
coordinatorsToCheckpoint, pendingCheckpoint, timer),
timer);
- CompletableFuture.allOf(masterStatesComplete,
coordinatorCheckpointsComplete)
- .whenCompleteAsync(
- (ignored, throwable) -> {
- final PendingCheckpoint
checkpoint =
-
FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
-
- if (throwable == null &&
checkpoint != null && !checkpoint.isDiscarded()) {
- // no exception, no
discarding, everything is OK
- final long checkpointId
= checkpoint.getCheckpointId();
- snapshotTaskState(
- timestamp,
- checkpointId,
-
checkpoint.getCheckpointStorageLocation(),
- request.props,
- executions,
-
request.advanceToEndOfTime);
-
-
coordinatorsToCheckpoint.forEach((ctx) ->
ctx.afterSourceBarrierInjection(checkpointId));
-
- onTriggerSuccess();
- } else {
- // the
initialization might not be finished yet
- if (checkpoint
== null) {
-
onTriggerFailure(request, throwable);
+ FutureUtils.assertNoException(
Review comment:
Thanks for the update.
I'm also not sure about `assertNoException` which calls `System.exit`
internally. It can cause problems because:
- there might be other jobs
- depending on the setup, it could be problematic to find the reason; e.g.
buffered logs can be lost
- we skip any cleanup
Instead, we could notify `CheckpointFailureManager` so that it would
terminate only this job.
What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]