Roman Khachatryan created FLINK-39738:
-----------------------------------------

             Summary: React to checkpoint ACK RPC failures
                 Key: FLINK-39738
                 URL: https://issues.apache.org/jira/browse/FLINK-39738
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing, Runtime / RPC
    Affects Versions: 2.3.0
            Reporter: Roman Khachatryan
            Assignee: Roman Khachatryan


Flink TMs use “fire-and-forget” semantics to send checkpoint ACK messages.

That means that any failures (e.g. exceeding pekko framesize) won’t fail the 
checkpoint; rather, it will expire due to checkpoint timeout, or the job would 
be restarted for some reason before that.

This potentially increases e2e latency and makes it difficult to debug and 
alert on such problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to