[jira] [Commented] (FLINK-4809) Operators should tolerate checkpoint failures

ASF GitHub Bot (JIRA) Mon, 20 Nov 2017 01:40:16 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259021#comment-16259021
 ]


ASF GitHub Bot commented on FLINK-4809:
---------------------------------------

Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4883#discussion_r151940619
  
    --- Diff: 
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java
 ---
    @@ -1041,27 +1065,27 @@ public void executeCheckpointing() throws Exception 
{
                                                
checkpointMetrics.getAlignmentDurationNanos() / 1_000_000,
                                                
checkpointMetrics.getSyncDurationMillis());
                                }
    -                   } finally {
    -                           if (failed) {
    -                                   // Cleanup to release resources
    -                                   for (OperatorSnapshotResult 
operatorSnapshotResult : operatorSnapshotsInProgress.values()) {
    -                                           if (null != 
operatorSnapshotResult) {
    -                                                   try {
    -                                                           
operatorSnapshotResult.cancel();
    -                                                   } catch (Exception e) {
    -                                                           LOG.warn("Could 
not properly cancel an operator snapshot result.", e);
    -                                                   }
    +                   } catch (Exception ex) {
    +                           // Cleanup to release resources
    --- End diff --
    
    Because this was moved from the `finally`-block into a `catch`-block where 
it is clear that the code failed.


> Operators should tolerate checkpoint failures
> ---------------------------------------------
>
>                 Key: FLINK-4809
>                 URL: https://issues.apache.org/jira/browse/FLINK-4809
>             Project: Flink
>          Issue Type: Sub-task
>          Components: State Backends, Checkpointing
>            Reporter: Stephan Ewen
>            Assignee: Stefan Richter
>             Fix For: 1.4.0
>
>
> Operators should try/catch exceptions in the synchronous and asynchronous 
> part of the checkpoint and send a {{DeclineCheckpoint}} message as a result.
> The decline message should have the failure cause attached to it.
> The checkpoint barrier should be sent anyways as a first step before 
> attempting to make a state checkpoint, to make sure that downstream operators 
> do not block in alignment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-4809) Operators should tolerate checkpoint failures

Reply via email to