[
https://issues.apache.org/jira/browse/FLINK-39946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bowen Li reassigned FLINK-39946:
--------------------------------
Assignee: Bowen Li
> Unaligned checkpoint restore can mis-map union input channel state after
> reordering UIDed sources
> -------------------------------------------------------------------------------------------------
>
> Key: FLINK-39946
> URL: https://issues.apache.org/jira/browse/FLINK-39946
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.19.1
> Reporter: Bowen Li
> Assignee: Bowen Li
> Priority: Major
>
> When a Flink job restores from an unaligned checkpoint, input channel state
> is restored by positional input gate/channel indexes rather than by a stable
> edge identity.
> This can break when a job has multiple source operators feeding a union and
> the source declarations are reordered between job versions. Even if the
> source operators have stable UIDs, the generated {{JobEdge}} / input gate
> order can still change because initial source traversal is based on transient
> stream node ordering. After restore, in-flight channel state from one logical
> source may be applied to another logical input gate.
> This is especially problematic for jobs that:
> * use unaligned checkpoints or aligned checkpoints that fall back to
> unaligned,
> * restore from checkpoint/savepoint state containing in-flight channel data,
> * reorder UIDed union inputs, for example by reordering Kafka source
> declarations.
> Expected behavior: if union inputs have stable operator identities, restoring
> channel state should preserve the logical source-to-input-gate mapping across
> source declaration reorders.
> Observed behavior: source operator UIDs remain stable, but union input gate
> positions can change, so channel state restore is tied to the wrong logical
> input.
> For this specific bug, it is an unaligned-checkpoint channel-state restore
> issue, not a normal aligned-checkpoint restore issue.
> Why:
> * Aligned checkpoints do not persist in-flight input channel buffers;
> barriers align first, so there is no channel state to remap on restore.
> * Unaligned checkpoints do persist in-flight channel state, and Flink
> restores that by positional {{{}gateIdx/inputChannelIdx{}}}. That is where
> reordered union inputs can restore the wrong logical channel state.
> One caveat: an “aligned” job can still hit it if it uses aligned checkpoint
> timeout and falls back to unaligned. In that case the affected checkpoint
> contains channel state, so the same bug applies.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)