[GitHub] [flink] tillrohrmann commented on a change in pull request #12670: [FLINK-18290][checkpointing] Fail job on checkpoint future failure instead of System.exit

GitBox Tue, 16 Jun 2020 07:43:16 -0700


tillrohrmann commented on a change in pull request #12670:
URL: https://github.com/apache/flink/pull/12670#discussion_r440906141




##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -538,51 +542,61 @@ private void 
startTriggeringCheckpoint(CheckpointTriggerRequest request) {
                                                                        
coordinatorsToCheckpoint, pendingCheckpoint, timer),
                                                        timer);
 
-                       FutureUtils.assertNoException(
-                               CompletableFuture.allOf(masterStatesComplete, 
coordinatorCheckpointsComplete)
-                                       .handleAsync(
-                                               (ignored, throwable) -> {
-                                                       final PendingCheckpoint 
checkpoint =
-                                                               
FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
-
-                                                       
Preconditions.checkState(
-                                                               checkpoint != 
null || throwable != null,
-                                                               "Either the 
pending checkpoint needs to be created or an error must have been occurred.");
-
-                                                       if (throwable != null) {
-                                                               // the 
initialization might not be finished yet
-                                                               if (checkpoint 
== null) {
-                                                                       
onTriggerFailure(request, throwable);
-                                                               } else {
-                                                                       
onTriggerFailure(checkpoint, throwable);
-                                                               }
+                       FutureUtils.waitForAll(asList(masterStatesComplete, 
coordinatorCheckpointsComplete))
+                               .handleAsync(
+                                       (ignored, throwable) -> {
+                                               final PendingCheckpoint 
checkpoint =
+                                                       
FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
+
+                                               Preconditions.checkState(
+                                                       checkpoint != null || 
throwable != null,
+                                                       "Either the pending 
checkpoint needs to be created or an error must have been occurred.");
+
+                                               if (throwable != null) {
+                                                       // the initialization 
might not be finished yet
+                                                       if (checkpoint == null) 
{
+                                                               
onTriggerFailure(request, throwable);
                                                        } else {
-                                                               if 
(checkpoint.isDiscarded()) {
-                                                                       
onTriggerFailure(
-                                                                               
checkpoint,
-                                                                               
new CheckpointException(
-                                                                               
        CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE,
-                                                                               
        checkpoint.getFailureCause()));
-                                                               } else {
-                                                                       // no 
exception, no discarding, everything is OK
-                                                                       final 
long checkpointId = checkpoint.getCheckpointId();
-                                                                       
snapshotTaskState(
-                                                                               
timestamp,
-                                                                               
checkpointId,
-                                                                               
checkpoint.getCheckpointStorageLocation(),
-                                                                               
request.props,
-                                                                               
executions,
-                                                                               
request.advanceToEndOfTime);
-
-                                                                       
coordinatorsToCheckpoint.forEach((ctx) -> 
ctx.afterSourceBarrierInjection(checkpointId));
-
-                                                                       
onTriggerSuccess();
-                                                               }
+                                                               
onTriggerFailure(checkpoint, throwable);
                                                        }
+                                               } else {
+                                                       if 
(checkpoint.isDiscarded()) {
+                                                               
onTriggerFailure(
+                                                                       
checkpoint,
+                                                                       new 
CheckpointException(
+                                                                               
CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE,
+                                                                               
checkpoint.getFailureCause()));
+                                                       } else {
+                                                               // no 
exception, no discarding, everything is OK
+                                                               final long 
checkpointId = checkpoint.getCheckpointId();
+                                                               
snapshotTaskState(
+                                                                       
timestamp,
+                                                                       
checkpointId,
+                                                                       
checkpoint.getCheckpointStorageLocation(),
+                                                                       
request.props,
+                                                                       
executions,
+                                                                       
request.advanceToEndOfTime);
+
+                                                               
coordinatorsToCheckpoint.forEach((ctx) -> 
ctx.afterSourceBarrierInjection(checkpointId));
+
+                                                               
onTriggerSuccess();
+                                                       }
+                                               }
 
-                                                       return null;
-                                               },
-                                               timer));
+                                               return null;
+                                       },
+                                       timer)
+                               .whenComplete((unused, error) -> {
+                                       if (error != null) {
+                                               if (!isShutdown()) {
+                                                       
failureManager.handleJobLevelCheckpointException(new 
CheckpointException(EXCEPTION, error), Optional.empty());

Review comment:
       If we are only talking about programming errors, then I believe we 
should call `System.exit` (== fail hard) because programming errors usually 
leave the system in a corrupted state. 
   
   Why do you think that it hides error details?
   
   Not doing resource clean up in a failure case is acceptable.
   
   Failing other jobs if the process has been corrupted is fine as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] tillrohrmann commented on a change in pull request #12670: [FLINK-18290][checkpointing] Fail job on checkpoint future failure instead of System.exit

Reply via email to