[
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537192#comment-17537192
]
fanrui commented on FLINK-27251:
--------------------------------
Hi [~pnowojski]
I have addressed your all suggestions, including:
* Move this code to, {{SubtaskCheckpointCoordinatorImpl#checkpointState}}
* _alignedBarrierTimeout_ should be executed in task thread
* Support abort checkpoint, cancel all outputBufferFuture
I submitted the [PR|https://github.com/apache/flink/pull/19723], please help to
review in your free time. It's similar with POC code.
Also, I run the CheckpointingTimeBenchmark(flink-benchmarks) with this PR in my
Mac. I think this PR is very useful for enable UC and set the
[execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout]>0.
This is Beachmark result in my Mac:
{code:java}
Benchmark (mode) Mode Cnt
Score Error Units
CheckpointingTimeBenchmark.checkpointSingleInput ALIGNED thrpt 30
1.402 ± 0.035 ops/s
CheckpointingTimeBenchmark.checkpointSingleInput UNALIGNED thrpt 30
401.145 ± 24.741 ops/s
CheckpointingTimeBenchmark.checkpointSingleInput UNALIGNED_1 thrpt 30
225.498 ± 9.758 ops/s {code}
Note: Due to different machine hardware, the results have some errors with the
community.
This
[link|http://codespeed.dak8s.net:8000/timeline/#/?exe=1,6&ben=checkpointSingleInput.UNALIGNED_1&env=2&revs=200&equid=off&quarts=on&extr=on]
is community benchmark result.
* Test1 ALIGNED : 1.4 ops/s , it's close with my result.
* Test2 UNALIGNED: 350 ops/s, 401 ops/s on my Mac, about 1.14 times the
community.
* Test3 UNALIGNED_1: 19 ops/s, 225 ops/s on my Mac, about 11.8 times the
community.
I guess the improvement of test2 may be due to different machine hardware, and
the improvement of test3 is mainly due to the current PR. We can view the
official benefit after merge
[here|http://codespeed.dak8s.net:8000/timeline/#/?exe=1,6&ben=checkpointSingleInput.UNALIGNED_1&env=2&revs=200&equid=off&quarts=on&extr=on]
.
> Timeout aligned to unaligned checkpoint barrier in the output buffers of an
> upstream subtask
> --------------------------------------------------------------------------------------------
>
> Key: FLINK-27251
> URL: https://issues.apache.org/jira/browse/FLINK-27251
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.0, 1.15.0
> Reporter: fanrui
> Assignee: fanrui
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime
> - triggerTime > timeout{_}. But the downstream task still needs wait for all
> barriers of upstream.
> If the back pressure is serve, the downstream task cannot receive all barrier
> within CP timeout, causes CP to fail.
>
> Can we support upstream Task switching from Aligned to UC? It means that when
> the barrier cannot be sent from the output buffer to the downstream task
> within the
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
> the upstream task switches to UC and takes a snapshot of the data before the
> barrier in the output buffer.
>
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)