[ 
https://issues.apache.org/jira/browse/FLINK-10753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682647#comment-16682647
 ] 

ASF GitHub Bot commented on FLINK-10753:
----------------------------------------

tillrohrmann commented on a change in pull request #7064: [FLINK-10753] Improve 
propagation and logging of snapshot exceptions
URL: https://github.com/apache/flink/pull/7064#discussion_r232468561
 
 

 ##########
 File path: 
flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/AbstractStreamOperator.java
 ##########
 @@ -413,8 +413,11 @@ public final OperatorSnapshotFutures snapshotState(long 
checkpointId, long times
                                snapshotException.addSuppressed(e);
                        }
 
-                       throw new Exception("Could not complete snapshot " + 
checkpointId + " for operator " +
-                               getOperatorName() + '.', snapshotException);
+                       String snapshotFailMessage = "Could not complete 
snapshot " + checkpointId + " for operator " +
+                               getOperatorName() + ".";
+
+                       LOG.info(snapshotFailMessage, snapshotException);
 
 Review comment:
   I think it would be better to log the failure in 
`RpcCheckpointResponder#declineCheckpoint` because it is there where the 
message leaves the `TaskExecutor` and we no longer have control over it. 
Moreover, we would cover other failures coming from the other calling paths as 
well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Propagate and log snapshotting exceptions
> -----------------------------------------
>
>                 Key: FLINK-10753
>                 URL: https://issues.apache.org/jira/browse/FLINK-10753
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Alexander Fedulov
>            Assignee: Stefan Richter
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>         Attachments: Screen Shot 2018-11-01 at 16.27.01.png
>
>
> Upon failure, {{AbstractStreamOperator.snapshotState}} rethrows a new 
> exception with the message "{{Could not complete snapshot {} for operator 
> {}.}}" and the original exception as the cause. 
> While handling the error, {{CheckpointCoordinator.discardCheckpoint}} method 
> logs only this  propagated message and not the original cause of the 
> exception.
> In addition, {{pendingCheckpoint.abortDeclined()}}, called from the 
> {{discardCheckpoint}}, reports the failed checkpoint with a misleading 
> message "{{Checkpoint was declined (tasks not ready)}}". This message is what 
> will be displayed in the UI (see attached).
>  Proposition:
>  # Log exception at the Task Manager (.snapshotState)
>  # Log cause, instead of cause.getMessage() at the JobsManager 
> (.dicardCheckpoint)
>  # Pass root cause to abortDeclined and propagate a more appropriate message 
> to the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to