[jira] [Updated] (FLINK-17350) StreamTask should always fail immediately on failures in synchronous part of a checkpoint

Piotr Nowojski (Jira) Thu, 23 Apr 2020 05:57:13 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-17350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Piotr Nowojski updated FLINK-17350:
-----------------------------------
    Description: 
This bugs also Affects 1.5.x branch.

As described in point 1 here: 
https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576

{{setTolerableCheckpointFailureNumber(...)}} and its deprecated 
{{setFailTaskOnCheckpointError(...)}} predecessor are implemented incorrectly. 
Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can 
lead to operators (and especially sinks with an external state) end up in an 
inconsistent state. That's also true even if they are not used, because of 
another issue: FLINK-17351

If we mix this with intermittent external system failure. Sink reports an 
exception, transaction was lost/aborted, Sink is in failed state, but if there 
will be a happy coincidence that it manages to accept further records, this 
exception can be lost and all of the records in those failed checkpoints will 
be lost forever as well.

For details please check FLINK-17327.


  was:
This bugs also Affects 1.5.x branch.

As described in point 1 here: 
https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576

{{setTolerableCheckpointFailureNumber(...)}} and its deprecated 
{{setFailTaskOnCheckpointError(...)}} predecessor are implemented incorrectly. 
Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can 
lead to operators (and especially sinks with an external state) end up in an 
inconsistent state. That's also true even if they are not used, because of 
another issue: PLACEHOLDER

For details please check FLINK-17327.

The problem boils down to a fact, that if operator/user functions throws an 
exception, job should always fail. There is no recovery from this. In case of 
{{FlinkKafkaProducer}} ignoring such failures might mean that whole transaction 
with all of it's records will be lost forever.


> StreamTask should always fail immediately on failures in synchronous part of 
> a checkpoint
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-17350
>                 URL: https://issues.apache.org/jira/browse/FLINK-17350
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Task
>    Affects Versions: 1.6.4, 1.7.2, 1.8.3, 1.9.2, 1.10.0
>            Reporter: Piotr Nowojski
>            Priority: Critical
>
> This bugs also Affects 1.5.x branch.
> As described in point 1 here: 
> https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
> {{setTolerableCheckpointFailureNumber(...)}} and its deprecated 
> {{setFailTaskOnCheckpointError(...)}} predecessor are implemented 
> incorrectly. Since Flink 1.5 
> (https://issues.apache.org/jira/browse/FLINK-4809) they can lead to operators 
> (and especially sinks with an external state) end up in an inconsistent 
> state. That's also true even if they are not used, because of another issue: 
> FLINK-17351
> If we mix this with intermittent external system failure. Sink reports an 
> exception, transaction was lost/aborted, Sink is in failed state, but if 
> there will be a happy coincidence that it manages to accept further records, 
> this exception can be lost and all of the records in those failed checkpoints 
> will be lost forever as well.
> For details please check FLINK-17327.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-17350) StreamTask should always fail immediately on failures in synchronous part of a checkpoint

Reply via email to