1u0 commented on a change in pull request #9131: [FLINK-12858][checkpointing] 
Stop-with-savepoint, workaround: fail whole job when savepoint is declined by a 
task
URL: https://github.com/apache/flink/pull/9131#discussion_r303938502
 
 

 ##########
 File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/LegacyScheduler.java
 ##########
 @@ -649,4 +659,22 @@ private String 
retrieveTaskManagerLocation(ExecutionAttemptID executionAttemptID
                        .map(TaskManagerLocation::toString)
                        .orElse("Unknown location");
        }
+
+       private static boolean isCheckpointDeclinedException(Throwable 
throwable) {
+               return ExceptionUtils.findThrowable(throwable, 
CheckpointException.class)
+                       .map(CheckpointException::getCheckpointFailureReason)
+                       .map(reason -> {
+                               switch (reason) {
+                                       case CHECKPOINT_DECLINED:
+                                       case CHECKPOINT_DECLINED_TASK_NOT_READY:
+                                       case CHECKPOINT_DECLINED_SUBSUMED:
+                                       case 
CHECKPOINT_DECLINED_ALIGNMENT_LIMIT_EXCEEDED:
+                                       case 
CHECKPOINT_DECLINED_INPUT_END_OF_STREAM:
 
 Review comment:
   **NB:** this check is very rough, it may be too pessimistic in a way, that 
some causes not necessary leave the job in half-locked state 
(`CHECKPOINT_DECLINED, CHECKPOINT_DECLINED_TASK_NOT_READY`).
   
    * `CHECKPOINT_DECLINED_ALIGNMENT_LIMIT_EXCEEDED` case is the one I can 
reproduce;
    * `CHECKPOINT_DECLINED_INPUT_END_OF_STREAM` is also a potential issue, but 
may be not easy to reproduce (should happen on checkpoints alignment in a join 
when one branch has passed checkpoint and the second one has just ended);
    * `CHECKPOINT_DECLINED_SUBSUMED` - should not happen, but left just to be 
more future proof;
    * `TASK_CHECKPOINT_FAILURE` - I'm not sure if this one should also be 
present here.
   
   Also, open question, what to do if exception is not a `CheckpointException`. 
We expect that such causes would fail the task that originated the exception, 
but I'm not sure how it would interfere with region recovery.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to