[ 
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536089#comment-17536089
 ] 

Piotr Nowojski commented on FLINK-27251:
----------------------------------------

Thanks for the answers. Re 3., thanks for the pointers, I've missed that. I 
think the 1. and 2. could be fixed in the PR. It will be easier to discuss it 
and other issues there.

Re 4. I'm not sure how complicated would be the FIFO queue solution? How many 
writer threads do we have right now? There is one 
{{ChannelStateWriteRequestExecutor}} per each subtask and each instance has 
it's own one single thread? If so, maybe your current proposal is actually 
fine? We do not support concurrent unaligned checkpoints, so when that thread 
is blocked by the timeout, it's queue of requests should be completely empty.

Maybe one missing thing is support of aborting checkpoints. If checkpoint is 
being aborted, it would be good to cancel those futures?

All in all I think +1 for this feature. It looks easier then I though/feared. 

> Timeout aligned to unaligned checkpoint barrier in the output buffers of an 
> upstream subtask
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27251
>                 URL: https://issues.apache.org/jira/browse/FLINK-27251
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: fanrui
>            Priority: Major
>             Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime 
> - triggerTime > timeout{_}. But the downstream task still needs wait for all 
> barriers of upstream. 
> If the back pressure is serve, the downstream task cannot receive all barrier 
> within CP timeout, causes CP to fail.
>  
> Can we support upstream Task switching from Aligned to UC? It means that when 
> the barrier cannot be sent from the output buffer to the downstream task 
> within the 
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
>  the upstream task switches to UC and takes a snapshot of the data before the 
> barrier in the output buffer.
>  
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to