Alexander Fedulov created FLINK-10753: -----------------------------------------
Summary: Propagate and log snapshotting exceptions Key: FLINK-10753 URL: https://issues.apache.org/jira/browse/FLINK-10753 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Affects Versions: 1.6.2 Reporter: Alexander Fedulov Attachments: Screen Shot 2018-11-01 at 16.27.01.png Upon failure, {{AbstractStreamOperator.snapshotState}} rethrows a new exception with the message "{{Could not complete snapshot {} for operator {}.}}" and the original exception as the cause. While handling the error, {{CheckpointCoordinator.discardCheckpoint}} method logs only this propagated message and not the original cause of the exception. In addition, {{pendingCheckpoint.abortDeclined()}}, called from the {{discardCheckpoint,}} reports the failed checkpoint with a misleading message "{{Checkpoint was declined (tasks not ready)}}". This message is what will be displayed in the UI (see attached). Proposition: # Log exception at the Task Manager (.snapshotState) # Log cause, instead of cause.getMessage() at the JobsManager (.dicardCheckpoint) # Pass root cause to abortDeclined and propagate a more appropriate message to the UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)