[ 
https://issues.apache.org/jira/browse/FLINK-26882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514214#comment-17514214
 ] 

Anton Kalashnikov commented on FLINK-26882:
-------------------------------------------

Some conclusions:

First of all, this test didn't start to fail, this test has never worked at 
all(it didn't work before FLINK-26789 and after FLINK-26789 as well). So since 
it is not a degradation we can easily revert commits or ignore 
tests([https://github.com/apache/flink/pull/19271). [~pnowojski], [~gaoyunhaii]?

Secondly, this test doesn't work because it validates the state incorrectly. 
More precisely, the static variable *CollectionSink#elements* collects all 
values and the test assumes that all these values would be in the checkpoint 
and when we restore we don't see any of these values again. But it is not true 
since the test doesn't provide any guarantees that all values 
*CollectionSink#elements*  are checkpointed. So if the flink was canceled 
during the last checkpoint, we take the previous one for recovery which 
contains in-flight data for unaligned checkpoint and as result, several last 
records will be repeated.

The last one, in general, I am concerned about the correctness of this test. I 
don't really understand what we try to check there since the job doesn't use 
any state from recovery. So for me, the test looks like that: 
* wait until all data processed
* the checkpoint store nothing(because all data were processed)
* restore from empty checkpoint to different parallelism
* check that we can process new(totally independent of the first case) data

Does it really make sense?

Since I don't fully understand the purpose of this test. I would like to ask 
[~yunta] or [~Yanfei Lei] to think about how to fix it or give me more details 
about the purpose of this test. I recently created this test 
*RestoreUpgradedJobITCase* which also checks the correctness of different 
states after the restoring from different snapshots. So maybe we somehow can 
adapt my test to different parallelism.(if the idea of checking is same)

> Unaligned checkpoint with 0s timeout would fail 
> RescaleCheckpointManuallyITCase
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-26882
>                 URL: https://issues.apache.org/jira/browse/FLINK-26882
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Tests
>            Reporter: Yun Tang
>            Assignee: Anton Kalashnikov
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> Once we make {{execution.checkpointing.unaligned: true}} and 
> {{execution.checkpointing.alignment-timeout: PT0S}}, the 
> RescaleCheckpointManuallyITCase.testCheckpointRescalingInKeyedState would 
> fail then.
> Borken instances:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33776&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5623
>  
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33787&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5626
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33787&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=12409
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5629
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=12409
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=baf26b34-3c6a-54e8-f93f-cf269b32f802&t=8c9d126d-57d2-5a9e-a8c8-ff53f7b35cd9&l=5733
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=a549b384-c55a-52c0-c451-00e0477ab6db&t=eef5922c-08d9-5ba3-7299-8393476594e7&l=12575
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=b78d9d30-509a-5cea-1fef-db7abaa325ae&l=5838
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=b0a398c0-685b-599c-eb57-c8c2a771138e&t=747432ad-a576-5911-1e2a-68c6bedc248a&l=12931
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=ea7cf968-e585-52cb-e0fc-f48de023a7ca&l=5682



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to