[GitHub] [flink] tillrohrmann commented on a change in pull request #12670: [FLINK-18290][checkpointing] Fail job on checkpoint future failure instead of System.exit

GitBox Wed, 17 Jun 2020 00:44:22 -0700


tillrohrmann commented on a change in pull request #12670:
URL: https://github.com/apache/flink/pull/12670#discussion_r441345987




##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -538,51 +542,61 @@ private void 
startTriggeringCheckpoint(CheckpointTriggerRequest request) {
                                                                        
coordinatorsToCheckpoint, pendingCheckpoint, timer),
                                                        timer);
 
-                       FutureUtils.assertNoException(
-                               CompletableFuture.allOf(masterStatesComplete, 
coordinatorCheckpointsComplete)
-                                       .handleAsync(
-                                               (ignored, throwable) -> {
-                                                       final PendingCheckpoint 
checkpoint =
-                                                               
FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
-
-                                                       
Preconditions.checkState(
-                                                               checkpoint != 
null || throwable != null,
-                                                               "Either the 
pending checkpoint needs to be created or an error must have been occurred.");
-
-                                                       if (throwable != null) {
-                                                               // the 
initialization might not be finished yet
-                                                               if (checkpoint 
== null) {
-                                                                       
onTriggerFailure(request, throwable);
-                                                               } else {
-                                                                       
onTriggerFailure(checkpoint, throwable);
-                                                               }
+                       FutureUtils.waitForAll(asList(masterStatesComplete, 
coordinatorCheckpointsComplete))
+                               .handleAsync(
+                                       (ignored, throwable) -> {
+                                               final PendingCheckpoint 
checkpoint =
+                                                       
FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
+
+                                               Preconditions.checkState(
+                                                       checkpoint != null || 
throwable != null,
+                                                       "Either the pending 
checkpoint needs to be created or an error must have been occurred.");
+
+                                               if (throwable != null) {
+                                                       // the initialization 
might not be finished yet
+                                                       if (checkpoint == null) 
{
+                                                               
onTriggerFailure(request, throwable);
                                                        } else {
-                                                               if 
(checkpoint.isDiscarded()) {
-                                                                       
onTriggerFailure(
-                                                                               
checkpoint,
-                                                                               
new CheckpointException(
-                                                                               
        CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE,
-                                                                               
        checkpoint.getFailureCause()));
-                                                               } else {
-                                                                       // no 
exception, no discarding, everything is OK
-                                                                       final 
long checkpointId = checkpoint.getCheckpointId();
-                                                                       
snapshotTaskState(
-                                                                               
timestamp,
-                                                                               
checkpointId,
-                                                                               
checkpoint.getCheckpointStorageLocation(),
-                                                                               
request.props,
-                                                                               
executions,
-                                                                               
request.advanceToEndOfTime);
-
-                                                                       
coordinatorsToCheckpoint.forEach((ctx) -> 
ctx.afterSourceBarrierInjection(checkpointId));
-
-                                                                       
onTriggerSuccess();
-                                                               }
+                                                               
onTriggerFailure(checkpoint, throwable);
                                                        }
+                                               } else {
+                                                       if 
(checkpoint.isDiscarded()) {
+                                                               
onTriggerFailure(
+                                                                       
checkpoint,
+                                                                       new 
CheckpointException(
+                                                                               
CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE,
+                                                                               
checkpoint.getFailureCause()));
+                                                       } else {
+                                                               // no 
exception, no discarding, everything is OK
+                                                               final long 
checkpointId = checkpoint.getCheckpointId();
+                                                               
snapshotTaskState(
+                                                                       
timestamp,
+                                                                       
checkpointId,
+                                                                       
checkpoint.getCheckpointStorageLocation(),
+                                                                       
request.props,
+                                                                       
executions,
+                                                                       
request.advanceToEndOfTime);
+
+                                                               
coordinatorsToCheckpoint.forEach((ctx) -> 
ctx.afterSourceBarrierInjection(checkpointId));
+
+                                                               
onTriggerSuccess();
+                                                       }
+                                               }
 
-                                                       return null;
-                                               },
-                                               timer));
+                                               return null;
+                                       },
+                                       timer)
+                               .whenComplete((unused, error) -> {
+                                       if (error != null) {
+                                               if (!isShutdown()) {
+                                                       
failureManager.handleJobLevelCheckpointException(new 
CheckpointException(EXCEPTION, error), Optional.empty());

Review comment:
       >  Also it's confusing, as it's not the normal/usual way how Flink fails 
(with System.exit I guess there would be no message in the error log, but only 
on stderr?).
   
   I fear that your statement is not true @pnowojski. We actually try to set 
wherever possible `FatalExitExceptionHandler` as the uncaught exception handler 
for every thread we start (it is not perfect yet). Hence, if there is an 
exception which bubbles up in the runtime code (e.g. `checkState`), then it 
will shut down the JVM. I believe that this is the right thing to do as these 
situations constitute a programming error and it is impossible to recover from 
all kinds of programming errors. Moreover, in case of programming errors or 
other fatal errors it is in general not possible to properly clean up. Just 
imagine an OOM when triggering the clean up code, for example.
   
   Moreover, we are talking about code where the assumption is that all 
recoverable exceptions are properly being handled before it is called. Hence, 
it only makes sense to fail hard if this assumption is wrong.
   
   > I.e. if programming error corrupts state then job restart can be 
undesirable.
   Just crashing doesn't prevent restart, but failing the job does.
   
   This is true, but at the moment we don't really support that you can 
directly fail a job w/o restarts, hence, the next best thing is to fail the 
process so that people see the problem.
   
   > Because System.exit can leave log buffers containing error details 
unflushed.
   
   Yes this can potentially happen. So far, however, this has never been a 
problem.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] tillrohrmann commented on a change in pull request #12670: [FLINK-18290][checkpointing] Fail job on checkpoint future failure instead of System.exit

Reply via email to