[
https://issues.apache.org/jira/browse/FLINK-17350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski updated FLINK-17350:
-----------------------------------
Description:
This bugs also Affects 1.5.x branch.
As described in point 1 here:
https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
{{setTolerableCheckpointFailureNumber(...)}} and its deprecated
{{setFailTaskOnCheckpointError(...)}} predecessor are implemented incorrectly.
Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can
lead to operators (and especially sinks with an external state) end up in an
inconsistent state. That's also true even if they are not used, because of
another issue: PLACEHOLDER
For details please check FLINK-17327.
The problem boils down to a fact, that if operator/user functions throws an
exception, job should always fail. There is no recovery from this. In case of
{{FlinkKafkaProducer}} ignoring such failures might mean that whole transaction
with all of it's records will be lost forever.
was:
This bugs also Affects 1.5.x branch.
As described
https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
{{setTolerableCheckpointFailureNumber(...)}} and its deprecated
{{setFailTaskOnCheckpointError(...)}} predecessor are implemented incorrectly.
Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can
lead to operators (and especially sinks with an external state) end up in an
inconsistent state. That's also true even if they are not used, because of
another issue: PLACEHOLDER
For details please check FLINK-17327.
The problem boils down to a fact, that if operator/user functions throws an
exception, job should always fail. There is no recovery from this. In case of
{{FlinkKafkaProducer}} ignoring such failures might mean that whole transaction
with all of it's records will be lost forever.
> StreamTask should always fail immediately on failures in synchronous part of
> a checkpoint
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-17350
> URL: https://issues.apache.org/jira/browse/FLINK-17350
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Task
> Affects Versions: 1.6.4, 1.7.2, 1.8.3, 1.9.2, 1.10.0
> Reporter: Piotr Nowojski
> Priority: Critical
>
> This bugs also Affects 1.5.x branch.
> As described in point 1 here:
> https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
> {{setTolerableCheckpointFailureNumber(...)}} and its deprecated
> {{setFailTaskOnCheckpointError(...)}} predecessor are implemented
> incorrectly. Since Flink 1.5
> (https://issues.apache.org/jira/browse/FLINK-4809) they can lead to operators
> (and especially sinks with an external state) end up in an inconsistent
> state. That's also true even if they are not used, because of another issue:
> PLACEHOLDER
> For details please check FLINK-17327.
> The problem boils down to a fact, that if operator/user functions throws an
> exception, job should always fail. There is no recovery from this. In case of
> {{FlinkKafkaProducer}} ignoring such failures might mean that whole
> transaction with all of it's records will be lost forever.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)