[jira] [Commented] (FLINK-27251) Timeout aligned to unaligned checkpoint barrier in the output buffers of an upstream subtask

Piotr Nowojski (Jira) Wed, 18 May 2022 03:53:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538730#comment-17538730
 ]


Piotr Nowojski commented on FLINK-27251:
----------------------------------------

Thanks [~fanrui] for the update. I will take a look :)

{quote}
But I don't understand "when that thread is blocked by the timeout, it's queue 
of requests should be completely empty.", could your share more details? Which 
thread? Is the ChannelStateWriteThread?
{quote}
Yes, I meant the {{ChannelStateWriterThread}}. If we are enqueuing timeoutable, 
but not yet timed out checkpoint barrier on the outputs, it means that we have 
already received AND processed ALL of the checkpoint barriers on the input 
channels. In other words, there under any circumstances there won't be need to 
spill/persist any in-flight data from the outputs for this checkpoints. So if 
we are blocking the {{ChannelStateWriterThread}} for this subtask with waiting 
for the future (for checkpoint barriers to timeout on the output or being sent 
to the downstream task), this {{ChannelStateWriterThread}} doesn't have 
anything else to do. It doesn't matter if we block it or not. New write 
requests to this {{ChannelStateWriterThread}} can only happen for a next 
checkpoint, that won't happen until the current checkpoint completes.

> Timeout aligned to unaligned checkpoint barrier in the output buffers of an 
> upstream subtask
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27251
>                 URL: https://issues.apache.org/jira/browse/FLINK-27251
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: fanrui
>            Assignee: fanrui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime 
> - triggerTime > timeout{_}. But the downstream task still needs wait for all 
> barriers of upstream. 
> If the back pressure is serve, the downstream task cannot receive all barrier 
> within CP timeout, causes CP to fail.
>  
> Can we support upstream Task switching from Aligned to UC? It means that when 
> the barrier cannot be sent from the output buffer to the downstream task 
> within the 
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
>  the upstream task switches to UC and takes a snapshot of the data before the 
> barrier in the output buffer.
>  
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (FLINK-27251) Timeout aligned to unaligned checkpoint barrier in the output buffers of an upstream subtask

Reply via email to