[
https://issues.apache.org/jira/browse/FLINK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weijie Guo updated FLINK-36733:
-------------------------------
Fix Version/s: 2.1.0
(was: 2.0.0)
(was: 1.19.3)
(was: 1.20.2)
> Don't transition task to RUNNING until the inputs are recovered (UC)
> --------------------------------------------------------------------
>
> Key: FLINK-36733
> URL: https://issues.apache.org/jira/browse/FLINK-36733
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Affects Versions: 1.20.0, 1.19.1
> Reporter: Roman Khachatryan
> Assignee: Roman Khachatryan
> Priority: Major
> Fix For: 2.1.0
>
>
> When recovering from an Unaligned Checkpoint, a task transitions to RUNNING
> after restoring:
> # Output channel state
> # Operator state
> # Input channel state
> However, the upstream task(s) might not yet send all the recovered buffers;
> therefore, in case of rescaling, downstream task must keep the virtual
> channel infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}}
> {{}}
> That means in particular that checkpoints might be triggered by the
> `CheckpointCoordinator` but declined by the downstream task (because
> {{RescalingStreamTaskNetworkInput}} doesn't support checkpointing).
>
> In case of long recovery, many declined checkpoints might exhaust some
> resources, e.g. transaction ID pools in our case.
> It's confusing (for humans and observability tools) to see tasks switched to
> RUNNING but still not able to checkpoint due to recovery.
>
> The proposal is to transition task to RUNNING only after all the inputs are
> recovered.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)