[
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534325#comment-17534325
]
Piotr Nowojski commented on FLINK-27251:
----------------------------------------
Thanks for rising the issue [~fanrui]. Yes, this is a known problem. While
developing the unaligned checkpoints, and especially when adding the timeouts
support, the timeouts proved very difficult to implement, causing lot's of
critical bugs and requiring a lot of effort to debug data corruption and
stabilise the feature. All in all, in the retrospect, our feel was that adding
the timeouts was not worth the effort and that users should be just as fine
using the unaligned checkpoints without any timeout. At one point I was even
thinking about removing feature all together in order to simplify the code base.
The main motivation issue is that without backpressure unaligned checkpoints
will capture only very negligible amount of the in-flight data, and with
backpressure, you most likely want to have fully unaligned checkpoints anyway,
so actually we don't see a clear benefit of enabling timeout in the first
place. From this perspective, I would like to first discuss if we even need
this feature.
Secondly, assuming that we really need it, one would have to very carefully
think how to implement it. Note that if you exceed the time limit on the
upstream subtask's output to send aligned barriers, when you want to convert
those barriers to unaligned checkpoint, this subtask has already completed the
checkpoint. While the timeout process would have to append the output in-flight
data to the checkpoint.
> Solve the problem that upstream Task cannot be switched to Unaligned
> Checkpoint
> -------------------------------------------------------------------------------
>
> Key: FLINK-27251
> URL: https://issues.apache.org/jira/browse/FLINK-27251
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.0, 1.15.0
> Reporter: fanrui
> Priority: Major
> Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime
> - triggerTime > timeout{_}. But the downstream task still needs wait for all
> barriers of upstream.
> If the back pressure is serve, the downstream task cannot receive all barrier
> within CP timeout, causes CP to fail.
>
> Can we support upstream Task switching from Aligned to UC? It means that when
> the barrier cannot be sent from the output buffer to the downstream task
> within the
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
> the upstream task switches to UC and takes a snapshot of the data before the
> barrier in the output buffer.
>
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)