[jira] [Closed] (FLINK-17350) StreamTask should always fail immediately on failures in synchronous part of a checkpoint

Piotr Nowojski (Jira) Sat, 16 May 2020 01:13:27 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-17350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Piotr Nowojski closed FLINK-17350.
----------------------------------
    Release Note: Failures in synchronous part of checkpointing (like an 
exceptions thrown by an operator) will fail it's Task (and job) immediately, 
regardless of the configuration parameters. Since Flink 1.5 such failures could 
be ignored by setting `setTolerableCheckpointFailureNumber(...)` or its 
deprecated `setFailTaskOnCheckpointError(...)` predecessor. Now both options 
will only affect asynchronous failures.
      Resolution: Fixed

Merged to master as 74e3d9f8bb..8ea458137e

I'm reluctant to back port this fix, as:
#  it's changing behaviour of checkpointing failures
# most likely this bug is only visible for operators with external state (like 
exactly-once sinks)
# our kafka sink was only theoretically affected by this - practically despite 
ignoring the failure, sink wouldn't recover and would start throwing some 
timeout exceptions eventually
# it's not trivial to back port the fix, as the relevant code has been changing 
over the time 

> StreamTask should always fail immediately on failures in synchronous part of 
> a checkpoint
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-17350
>                 URL: https://issues.apache.org/jira/browse/FLINK-17350
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Task
>    Affects Versions: 1.6.4, 1.7.2, 1.8.3, 1.9.2, 1.10.0
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>
> This bugs also Affects 1.5.x branch.
> As described in point 1 here: 
> https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
> {{setTolerableCheckpointFailureNumber(...)}} and its deprecated 
> {{setFailTaskOnCheckpointError(...)}} predecessor are implemented 
> incorrectly. Since Flink 1.5 
> (https://issues.apache.org/jira/browse/FLINK-4809) they can lead to operators 
> (and especially sinks with an external state) end up in an inconsistent 
> state. That's also true even if they are not used, because of another issue: 
> FLINK-17351
> If we mix this with intermittent external system failure. Sink reports an 
> exception, transaction was lost/aborted, Sink is in failed state, but if 
> there will be a happy coincidence that it manages to accept further records, 
> this exception can be lost and all of the records in those failed checkpoints 
> will be lost forever as well.
> For details please check FLINK-17327.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-17350) StreamTask should always fail immediately on failures in synchronous part of a checkpoint

Reply via email to