[ 
https://issues.apache.org/jira/browse/FLINK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333250#comment-17333250
 ] 

Anton Kalashnikov commented on FLINK-22137:
-------------------------------------------

I just duplicate comment from parent ticket:

 

I checked the suggested scenarios. I didn't find any problem which would be 
specific for the unaligned checkpoint.

Settings:

*Cluster*: Amazon EMR (4 instances: 4 vCore, 16 GiB memory)
*Cluster run*: ./bin/yarn-session.sh --detached
*Job for testing*: DataStreamAllroundTestProgram and more simple 
TopSpeedWindowing.
*Checkpoint*: unaligned
*Job arguments*: --environment.externalize_checkpoint true 
--environment.parallelism 2 --state_backend.checkpoint_directory 
s3://anton-flink-test/checkpoints  --state_backend rocks
*Parallelism*: 1 - 9(Just in case, DataStreamAllroundTestProgram has 9 tasks so 
in max 9 * 9 = 81 subtasks)

 

A small notice from me. If hashmap use as a state backend, there are a lot of 
problems appear. For example, OOM or network issues(timeout) but it can be 
observed for both aligned and unaligned checkpoints. So again, I didn't find 
the specific unaligned checkpoint problems.

> Execute unaligned checkpoint test on a cluster
> ----------------------------------------------
>
>                 Key: FLINK-22137
>                 URL: https://issues.apache.org/jira/browse/FLINK-22137
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Arvid Heise
>            Priority: Major
>
> Start application and at some point cancel/induce failure, the user needs to 
> restart from a retained checkpoint with
> *     lower
> *     same
> *     higher degree of parallelism.
> To enable unaligned checkpoints, set
> *     execution.checkpointing.unaligned: true
> *     execution.checkpointing.alignment-timeout to 0s, 10s, 1min (for high 
> backpressure)
> The primary objective is to check if all data is recovered properly and if 
> the semantics is correct (does state match input?).
> The secondary objective is to check if Flink UI shows the information 
> correctly:
> *     unaligned checkpoint enabled on job level
> *     timeout on job level
> *     for each checkpoint, if it's unaligned or not; how much data was written



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to