[ 
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537192#comment-17537192
 ] 

fanrui commented on FLINK-27251:
--------------------------------

Hi [~pnowojski]

I have addressed your all suggestions, including:
 * Move this code to, {{SubtaskCheckpointCoordinatorImpl#checkpointState}}
 * _alignedBarrierTimeout_ should be executed in task thread
 * Support abort checkpoint, cancel all outputBufferFuture

I submitted the [PR|https://github.com/apache/flink/pull/19723], please help to 
review in your free time. It's similar with POC code.

Also, I run the CheckpointingTimeBenchmark(flink-benchmarks) with this PR in my 
Mac. I think this PR is very useful for enable UC and set the  
[execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout]>0.

 

This is Beachmark result in my Mac:

 
{code:java}
Benchmark                                              (mode)   Mode  Cnt    
Score    Error  Units
CheckpointingTimeBenchmark.checkpointSingleInput      ALIGNED  thrpt   30    
1.402 ±  0.035  ops/s
CheckpointingTimeBenchmark.checkpointSingleInput    UNALIGNED  thrpt   30  
401.145 ± 24.741  ops/s
CheckpointingTimeBenchmark.checkpointSingleInput  UNALIGNED_1  thrpt   30  
225.498 ±  9.758  ops/s {code}
Note: Due to different machine hardware, the results have some errors with the 
community.

 

This 
[link|http://codespeed.dak8s.net:8000/timeline/#/?exe=1,6&ben=checkpointSingleInput.UNALIGNED_1&env=2&revs=200&equid=off&quarts=on&extr=on]
 is community benchmark result. 
 * Test1 ALIGNED : 1.4 ops/s , it's close with my result.
 * Test2 UNALIGNED: 350 ops/s, 401 ops/s on my Mac, about 1.14 times the 
community. 
 * Test3 UNALIGNED_1: 19 ops/s, 225 ops/s on my Mac, about 11.8 times the 
community.

I guess the improvement of test2 may be due to different machine hardware, and 
the improvement of test3 is mainly due to the current PR. We can view the 
official benefit after merge 
[here|http://codespeed.dak8s.net:8000/timeline/#/?exe=1,6&ben=checkpointSingleInput.UNALIGNED_1&env=2&revs=200&equid=off&quarts=on&extr=on]
 .

 

> Timeout aligned to unaligned checkpoint barrier in the output buffers of an 
> upstream subtask
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27251
>                 URL: https://issues.apache.org/jira/browse/FLINK-27251
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: fanrui
>            Assignee: fanrui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime 
> - triggerTime > timeout{_}. But the downstream task still needs wait for all 
> barriers of upstream. 
> If the back pressure is serve, the downstream task cannot receive all barrier 
> within CP timeout, causes CP to fail.
>  
> Can we support upstream Task switching from Aligned to UC? It means that when 
> the barrier cannot be sent from the output buffer to the downstream task 
> within the 
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
>  the upstream task switches to UC and takes a snapshot of the data before the 
> barrier in the output buffer.
>  
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to