[ 
https://issues.apache.org/jira/browse/FLINK-36295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882674#comment-17882674
 ] 

Matthias Pohl edited comment on FLINK-36295 at 9/18/24 11:48 AM:
-----------------------------------------------------------------

Several things were identified during the investigation so far:
* The test itself can have different behavior (with the same outcome) depending 
on whether the rescale is triggered by the StateTransitionManager or the 
failure due to missing TM slots. The test should be hardened accordingly to 
ensure consistent behavior (waiting for desired resources to be met to allow 
for the TM removal)
* The final wait in the test (see [line 
202|https://github.com/apache/flink/blob/db682b9869dcedce1873c6ee5252b0e351373a69/flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerClusterITCase.java#L202])
 can cause inconsistent behavior if the job ran with sufficient resources (i.e. 
2 slots allocated) in the first place. In that case, the wait wouldn't delay 
the test code execution (this was actually the case for failed test run and 
might contribute to the behavior I summarize in the last item in this list)
* (/) [[Reasonsing 
below|https://issues.apache.org/jira/browse/FLINK-36295?focusedCommentId=17882688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17882688]]
 There is a strange observation in the logs where the StateTransitionManager is 
transitioning into Stabilizing phase with "Desired resources are not met" log 
message indicating that there are not enough resources available for immediate 
state transitioning. But the 4 task slots become available earlier (indicated 
by the {{DefaultDeclarativeSlotPool}} with "Acquired new resources; new total 
acquired resources: ResourceCounter{resources={ResourceProfile{UNKNOWN}=4}}")
* The actual test failure is caused by the getHistory() returning an empty list 
even though the checkpoint snapshot returned a checkpoint count larger than 0 
earlier in the test. We checked that the count and the history are actually 
synchronously modified. So, the diff shouldn't happen.

I attached the debug logs of a locally failed test run to this Jira issue 
({{FLINK-36295.failure.with-revert.debug.log}}).

Command for extracting less noisy logs: 
{code}
$ cuts off the shutdown logs that appear after the test method and removes log 
messages of Loggers that are unrelated to this issue
head -926 FLINK-36295.failure.with-revert.debug.log | grep -vE 
'MetricRegistry|DispatcherRestEndpoint|MetricRegistryImpl' | less -S
{code}


was (Author: mapohl):
Several things were identified during the investigation so far:
* The test itself can have different behavior (with the same outcome) depending 
on whether the rescale is triggered by the StateTransitionManager or the 
failure due to missing TM slots. The test should be hardened accordingly to 
ensure consistent behavior (waiting for desired resources to be met to allow 
for the TM removal)
* The final wait in the test (see [line 
202|https://github.com/apache/flink/blob/db682b9869dcedce1873c6ee5252b0e351373a69/flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerClusterITCase.java#L202])
 can cause inconsistent behavior if the job ran with sufficient resources (i.e. 
2 slots allocated) in the first place. In that case, the wait wouldn't delay 
the test code execution (this was actually the case for failed test run and 
might contribute to the behavior I summarize in the last item in this list)
* (/) [Reasonsing below] There is a strange observation in the logs where the 
StateTransitionManager is transitioning into Stabilizing phase with "Desired 
resources are not met" log message indicating that there are not enough 
resources available for immediate state transitioning. But the 4 task slots 
become available earlier (indicated by the {{DefaultDeclarativeSlotPool}} with 
"Acquired new resources; new total acquired resources: 
ResourceCounter{resources={ResourceProfile{UNKNOWN}=4}}")
* The actual test failure is caused by the getHistory() returning an empty list 
even though the checkpoint snapshot returned a checkpoint count larger than 0 
earlier in the test. We checked that the count and the history are actually 
synchronously modified. So, the diff shouldn't happen.

I attached the debug logs of a locally failed test run to this Jira issue 
({{FLINK-36295.failure.with-revert.debug.log}}).

Command for extracting less noisy logs: 
{code}
$ cuts off the shutdown logs that appear after the test method and removes log 
messages of Loggers that are unrelated to this issue
head -926 FLINK-36295.failure.with-revert.debug.log | grep -vE 
'MetricRegistry|DispatcherRestEndpoint|MetricRegistryImpl' | less -S
{code}

> AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale 
> failed with 
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-36295
>                 URL: https://issues.apache.org/jira/browse/FLINK-36295
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 2.0-preview
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Blocker
>              Labels: test-stability
>         Attachments: 
> FLINK-36295.failure.62156.20240916.1.logs-cron_jdk17-test_cron_jdk17_core-1726454552.log,
>  FLINK-36295.failure.with-revert.debug.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=10234
> {code}
> Sep 16 03:06:30 03:06:30.168 [ERROR] Tests run: 3, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 5.275 s <<< FAILURE! -- in 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase
> Sep 16 03:06:30 03:06:30.168 [ERROR] 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale
>  -- Time elapsed: 0.676 s <<< ERROR!
> Sep 16 03:06:30 java.lang.IndexOutOfBoundsException: Index: -1
> Sep 16 03:06:30       at 
> java.base/java.util.Collections$EmptyList.get(Collections.java:4586)
> Sep 16 03:06:30       at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:214)
> Sep 16 03:06:30       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> Sep 16 03:06:30
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to