[
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538730#comment-17538730
]
Piotr Nowojski commented on FLINK-27251:
----------------------------------------
Thanks [~fanrui] for the update. I will take a look :)
{quote}
But I don't understand "when that thread is blocked by the timeout, it's queue
of requests should be completely empty.", could your share more details? Which
thread? Is the ChannelStateWriteThread?
{quote}
Yes, I meant the {{ChannelStateWriterThread}}. If we are enqueuing timeoutable,
but not yet timed out checkpoint barrier on the outputs, it means that we have
already received AND processed ALL of the checkpoint barriers on the input
channels. In other words, there under any circumstances there won't be need to
spill/persist any in-flight data from the outputs for this checkpoints. So if
we are blocking the {{ChannelStateWriterThread}} for this subtask with waiting
for the future (for checkpoint barriers to timeout on the output or being sent
to the downstream task), this {{ChannelStateWriterThread}} doesn't have
anything else to do. It doesn't matter if we block it or not. New write
requests to this {{ChannelStateWriterThread}} can only happen for a next
checkpoint, that won't happen until the current checkpoint completes.
> Timeout aligned to unaligned checkpoint barrier in the output buffers of an
> upstream subtask
> --------------------------------------------------------------------------------------------
>
> Key: FLINK-27251
> URL: https://issues.apache.org/jira/browse/FLINK-27251
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.0, 1.15.0
> Reporter: fanrui
> Assignee: fanrui
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime
> - triggerTime > timeout{_}. But the downstream task still needs wait for all
> barriers of upstream.
> If the back pressure is serve, the downstream task cannot receive all barrier
> within CP timeout, causes CP to fail.
>
> Can we support upstream Task switching from Aligned to UC? It means that when
> the barrier cannot be sent from the output buffer to the downstream task
> within the
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
> the upstream task switches to UC and takes a snapshot of the data before the
> barrier in the output buffer.
>
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)