[ 
https://issues.apache.org/jira/browse/FLINK-26882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514452#comment-17514452
 ] 

Yun Tang commented on FLINK-26882:
----------------------------------

First of all, thanks very much for [~akalashnikov]'s great work. It's my bad to 
not make the {{RescaleCheckpointManuallyITCase}} stable.

I agree that the root cause of the unstable case is due to the current test 
case doesn't provide any guarantees that all content of state are checkpointed. 
However, I still need to give my two cents here to reveal the truth. If you can 
take a look at the comments of the {{RescaleCheckpointManuallyITCase}}, you can 
find that the test refers to {{RescalingITCase}}, and the main logic is almost 
the same as {{RescalingITCase#testSavepointRescalingKeyedState}}. 
Actually, current {{RescaleCheckpointManuallyITCase}} could verify the keyed 
state had been correctly restored with rescaling, and you can refer to the 
class {{#SubtaskIndexFlatMapper}} to see that there exit two value-states named 
{{counter}} and {{sum}}. These two value-states are the main targets to verify. 
Once restored, previous {{counter}} and {{sum}} would be picked up again, and 
that's why we think the expected elements in the 2nd rescale would be 
[numberElements + 
numberElements2|https://github.com/apache/flink/blob/4c8995917885e301ca11023fb5e4eb3d0b7a0c7e/flink-tests/src/test/java/org/apache/flink/test/checkpointing/RescaleCheckpointManuallyITCase.java#L152]

In other words, the 2nd run does not restore from an empty checkpoint. And we 
indeed leverage states within class {{#SubtaskIndexFlatMapper}} to verify 
checkpoint restored as expected.

So the next question is how to make guarantees that all states within 
{{#SubtaskIndexFlatMapper}} could be included in the next checkpoints. If the 
2nd job could restore from a checkpoint which triggers after we call 
{{CollectionSink#getElementsSet()}}, then we can say that checkpoint could be a 
safe one. And thanks for FLINK-24280, we can now trigger manual checkpoint in a 
mini cluster.

The last question is whther we can adopt *RestoreUpgradedJobITCase* to check 
the correctness of different snapshots. Unfortunatly, it cannot satisify our 
request as we want to verify the correctness of RocksDB keyed state rescale 
while {{RestoreUpgradedJobITCase}} only includes operator state. The reason why 
we introduce {{RescaleCheckpointManuallyITCase}} is that we improved the 
performance of RocksDB rescale via leveraging its {{deleteRange}} API (in 
FLINK-21321), which could help much on reactive mode during rescaling, and 
current Flink lacks of an IT case to verify checkpoint rescale.

I noticed that Anto had created a 
[PR|https://github.com/apache/flink/pull/19271] to ignore this test. I feel 
very sorry for this unstable test and could [~akalashnikov] also just spend 
some time to take a look at [my fix 
solution|https://github.com/apache/flink/pull/19276]? It's very easy to 
understand and could also be verified in local environment.

Finally, I just want to thank for [~akalashnikov]'s great work to figure out 
the unstable reason once again.




> Unaligned checkpoint with 0s timeout would fail 
> RescaleCheckpointManuallyITCase
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-26882
>                 URL: https://issues.apache.org/jira/browse/FLINK-26882
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Tests
>            Reporter: Yun Tang
>            Assignee: Anton Kalashnikov
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> Once we make {{execution.checkpointing.unaligned: true}} and 
> {{execution.checkpointing.alignment-timeout: PT0S}}, the 
> RescaleCheckpointManuallyITCase.testCheckpointRescalingInKeyedState would 
> fail then.
> Borken instances:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33776&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5623
>  
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33787&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5626
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33787&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=12409
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5629
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=12409
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=baf26b34-3c6a-54e8-f93f-cf269b32f802&t=8c9d126d-57d2-5a9e-a8c8-ff53f7b35cd9&l=5733
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=a549b384-c55a-52c0-c451-00e0477ab6db&t=eef5922c-08d9-5ba3-7299-8393476594e7&l=12575
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=b78d9d30-509a-5cea-1fef-db7abaa325ae&l=5838
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=b0a398c0-685b-599c-eb57-c8c2a771138e&t=747432ad-a576-5911-1e2a-68c6bedc248a&l=12931
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=ea7cf968-e585-52cb-e0fc-f48de023a7ca&l=5682



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to