[ 
https://issues.apache.org/jira/browse/FLINK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Guo updated FLINK-36733:
-------------------------------
    Fix Version/s: 2.1.0
                       (was: 2.0.0)
                       (was: 1.19.3)
                       (was: 1.20.2)

> Don't transition task to RUNNING until the inputs are recovered (UC)
> --------------------------------------------------------------------
>
>                 Key: FLINK-36733
>                 URL: https://issues.apache.org/jira/browse/FLINK-36733
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.20.0, 1.19.1
>            Reporter: Roman Khachatryan
>            Assignee: Roman Khachatryan
>            Priority: Major
>             Fix For: 2.1.0
>
>
> When recovering from an Unaligned Checkpoint, a task transitions to RUNNING 
> after restoring:
>  # Output channel state
>  # Operator state
>  # Input channel state 
> However, the upstream task(s) might not yet send all the recovered buffers; 
> therefore, in case of rescaling, downstream task must keep the virtual 
> channel infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}}
> {{}}
> That means in particular that checkpoints might be triggered by the 
> `CheckpointCoordinator` but declined by the downstream task (because 
> {{RescalingStreamTaskNetworkInput}} doesn't support checkpointing).
>  
> In case of long recovery, many declined checkpoints might exhaust some 
> resources, e.g. transaction ID pools in our case.
> It's confusing (for humans and observability tools) to see tasks switched to 
> RUNNING but still not able to checkpoint due to recovery.
>  
> The proposal is to transition task to RUNNING only after all the inputs are 
> recovered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to