[
https://issues.apache.org/jira/browse/FLINK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903082#comment-17903082
]
Alexander Fedulov commented on FLINK-36733:
-------------------------------------------
[~roman] I am working on preparing the 1.19.2 and 1.20.1 releases. Do you have
anything in progress that you think we can reasonable get into these patch
releases? Otherwise I would bump the fix versions to 1.19.3 and 1.20.2.
> Don't transition task to RUNNING until the inputs are recovered (UC)
> --------------------------------------------------------------------
>
> Key: FLINK-36733
> URL: https://issues.apache.org/jira/browse/FLINK-36733
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Affects Versions: 1.20.0, 1.19.1
> Reporter: Roman Khachatryan
> Assignee: Roman Khachatryan
> Priority: Major
> Fix For: 1.19.2, 1.20.1
>
>
> When recovering from an Unaligned Checkpoint, a task transitions to RUNNING
> after restoring:
> # Output channel state
> # Operator state
> # Input channel state
> However, the upstream task(s) might not yet send all the recovered buffers;
> therefore, in case of rescaling, downstream task must keep the virtual
> channel infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}}
> {{}}
> That means in particular that checkpoints might be triggered by the
> `CheckpointCoordinator` but declined by the downstream task (because
> {{RescalingStreamTaskNetworkInput}} doesn't support checkpointing).
>
> In case of long recovery, many declined checkpoints might exhaust some
> resources, e.g. transaction ID pools in our case.
> It's confusing (for humans and observability tools) to see tasks switched to
> RUNNING but still not able to checkpoint due to recovery.
>
> The proposal is to transition task to RUNNING only after all the inputs are
> recovered.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)