[
https://issues.apache.org/jira/browse/FLINK-26882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514452#comment-17514452
]
Yun Tang commented on FLINK-26882:
----------------------------------
First of all, thanks very much for [~akalashnikov]'s great work. It's my bad to
not make the {{RescaleCheckpointManuallyITCase}} stable.
I agree that the root cause of the unstable case is due to the current test
case doesn't provide any guarantees that all content of state are checkpointed.
However, I still need to give my two cents here to reveal the truth. If you can
take a look at the comments of the {{RescaleCheckpointManuallyITCase}}, you can
find that the test refers to {{RescalingITCase}}, and the main logic is almost
the same as {{RescalingITCase#testSavepointRescalingKeyedState}}.
Actually, current {{RescaleCheckpointManuallyITCase}} could verify the keyed
state had been correctly restored with rescaling, and you can refer to the
class {{#SubtaskIndexFlatMapper}} to see that there exit two value-states named
{{counter}} and {{sum}}. These two value-states are the main targets to verify.
Once restored, previous {{counter}} and {{sum}} would be picked up again, and
that's why we think the expected elements in the 2nd rescale would be
[numberElements +
numberElements2|https://github.com/apache/flink/blob/4c8995917885e301ca11023fb5e4eb3d0b7a0c7e/flink-tests/src/test/java/org/apache/flink/test/checkpointing/RescaleCheckpointManuallyITCase.java#L152]
In other words, the 2nd run does not restore from an empty checkpoint. And we
indeed leverage states within class {{#SubtaskIndexFlatMapper}} to verify
checkpoint restored as expected.
So the next question is how to make guarantees that all states within
{{#SubtaskIndexFlatMapper}} could be included in the next checkpoints. If the
2nd job could restore from a checkpoint which triggers after we call
{{CollectionSink#getElementsSet()}}, then we can say that checkpoint could be a
safe one. And thanks for FLINK-24280, we can now trigger manual checkpoint in a
mini cluster.
The last question is whther we can adopt *RestoreUpgradedJobITCase* to check
the correctness of different snapshots. Unfortunatly, it cannot satisify our
request as we want to verify the correctness of RocksDB keyed state rescale
while {{RestoreUpgradedJobITCase}} only includes operator state. The reason why
we introduce {{RescaleCheckpointManuallyITCase}} is that we improved the
performance of RocksDB rescale via leveraging its {{deleteRange}} API (in
FLINK-21321), which could help much on reactive mode during rescaling, and
current Flink lacks of an IT case to verify checkpoint rescale.
I noticed that Anto had created a
[PR|https://github.com/apache/flink/pull/19271] to ignore this test. I feel
very sorry for this unstable test and could [~akalashnikov] also just spend
some time to take a look at [my fix
solution|https://github.com/apache/flink/pull/19276]? It's very easy to
understand and could also be verified in local environment.
Finally, I just want to thank for [~akalashnikov]'s great work to figure out
the unstable reason once again.
> Unaligned checkpoint with 0s timeout would fail
> RescaleCheckpointManuallyITCase
> -------------------------------------------------------------------------------
>
> Key: FLINK-26882
> URL: https://issues.apache.org/jira/browse/FLINK-26882
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Tests
> Reporter: Yun Tang
> Assignee: Anton Kalashnikov
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.16.0
>
>
> Once we make {{execution.checkpointing.unaligned: true}} and
> {{execution.checkpointing.alignment-timeout: PT0S}}, the
> RescaleCheckpointManuallyITCase.testCheckpointRescalingInKeyedState would
> fail then.
> Borken instances:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33776&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5623
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33787&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5626
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33787&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=12409
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5629
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=12409
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=baf26b34-3c6a-54e8-f93f-cf269b32f802&t=8c9d126d-57d2-5a9e-a8c8-ff53f7b35cd9&l=5733
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=a549b384-c55a-52c0-c451-00e0477ab6db&t=eef5922c-08d9-5ba3-7299-8393476594e7&l=12575
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=b78d9d30-509a-5cea-1fef-db7abaa325ae&l=5838
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=b0a398c0-685b-599c-eb57-c8c2a771138e&t=747432ad-a576-5911-1e2a-68c6bedc248a&l=12931
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=33779&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=ea7cf968-e585-52cb-e0fc-f48de023a7ca&l=5682
--
This message was sent by Atlassian Jira
(v8.20.1#820001)