[ 
https://issues.apache.org/jira/browse/FLINK-36295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882674#comment-17882674
 ] 

Matthias Pohl edited comment on FLINK-36295 at 9/18/24 11:49 AM:
-----------------------------------------------------------------

Several things were identified during the investigation so far:
 * The test itself can have different behavior (with the same outcome) 
depending on whether the rescale is triggered by the StateTransitionManager or 
the failure due to missing TM slots. The test should be hardened accordingly to 
ensure consistent behavior (waiting for desired resources to be met to allow 
for the TM removal)
 * The final wait in the test (see [line 
202|https://github.com/apache/flink/blob/db682b9869dcedce1873c6ee5252b0e351373a69/flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerClusterITCase.java#L202])
 can cause inconsistent behavior if the job ran with sufficient resources (i.e. 
2 slots allocated) in the first place. In that case, the wait wouldn't delay 
the test code execution (this was actually the case for failed test run and 
might contribute to the behavior I summarize in the last item in this list)
 * (/) [ [Reasonsing 
below|https://issues.apache.org/jira/browse/FLINK-36295?focusedCommentId=17882688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17882688]
 ] There is a strange observation in the logs where the StateTransitionManager 
is transitioning into Stabilizing phase with "Desired resources are not met" 
log message indicating that there are not enough resources available for 
immediate state transitioning. But the 4 task slots become available earlier 
(indicated by the {{DefaultDeclarativeSlotPool}} with "Acquired new resources; 
new total acquired resources: ResourceCounter\{resources={ResourceProfile 
{UNKNOWN}
=4}}")

 * The actual test failure is caused by the getHistory() returning an empty 
list even though the checkpoint snapshot returned a checkpoint count larger 
than 0 earlier in the test. We checked that the count and the history are 
actually synchronously modified. So, the diff shouldn't happen.

I attached the debug logs of a locally failed test run to this Jira issue 
({{{}FLINK-36295.failure.with-revert.debug.log{}}}).

Command for extracting less noisy logs:
{code:java}
$ cuts off the shutdown logs that appear after the test method and removes log 
messages of Loggers that are unrelated to this issue
head -926 FLINK-36295.failure.with-revert.debug.log | grep -vE 
'MetricRegistry|DispatcherRestEndpoint|MetricRegistryImpl' | less -S
{code}


was (Author: mapohl):
Several things were identified during the investigation so far:
 * The test itself can have different behavior (with the same outcome) 
depending on whether the rescale is triggered by the StateTransitionManager or 
the failure due to missing TM slots. The test should be hardened accordingly to 
ensure consistent behavior (waiting for desired resources to be met to allow 
for the TM removal)
 * The final wait in the test (see [line 
202|https://github.com/apache/flink/blob/db682b9869dcedce1873c6ee5252b0e351373a69/flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerClusterITCase.java#L202])
 can cause inconsistent behavior if the job ran with sufficient resources (i.e. 
2 slots allocated) in the first place. In that case, the wait wouldn't delay 
the test code execution (this was actually the case for failed test run and 
might contribute to the behavior I summarize in the last item in this list)
 * (/) [[Reasonsing 
below|https://issues.apache.org/jira/browse/FLINK-36295?focusedCommentId=17882688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17882688]]
 There is a strange observation in the logs where the StateTransitionManager is 
transitioning into Stabilizing phase with "Desired resources are not met" log 
message indicating that there are not enough resources available for immediate 
state transitioning. But the 4 task slots become available earlier (indicated 
by the {{DefaultDeclarativeSlotPool}} with "Acquired new resources; new total 
acquired resources: ResourceCounter\{resources={ResourceProfile {UNKNOWN}
=4}}")

 * The actual test failure is caused by the getHistory() returning an empty 
list even though the checkpoint snapshot returned a checkpoint count larger 
than 0 earlier in the test. We checked that the count and the history are 
actually synchronously modified. So, the diff shouldn't happen.

I attached the debug logs of a locally failed test run to this Jira issue 
({{{}FLINK-36295.failure.with-revert.debug.log{}}}).

Command for extracting less noisy logs:
{code:java}
$ cuts off the shutdown logs that appear after the test method and removes log 
messages of Loggers that are unrelated to this issue
head -926 FLINK-36295.failure.with-revert.debug.log | grep -vE 
'MetricRegistry|DispatcherRestEndpoint|MetricRegistryImpl' | less -S
{code}

> AdaptiveSchedulerClusterITCase. testCheckpointStatsPersistedAcrossRescale 
> failed with 
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-36295
>                 URL: https://issues.apache.org/jira/browse/FLINK-36295
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 2.0-preview
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Blocker
>              Labels: test-stability
>         Attachments: 
> FLINK-36295.failure.62156.20240916.1.logs-cron_jdk17-test_cron_jdk17_core-1726454552.log,
>  FLINK-36295.failure.with-revert.debug.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=10234
> {code}
> Sep 16 03:06:30 03:06:30.168 [ERROR] Tests run: 3, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 5.275 s <<< FAILURE! -- in 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase
> Sep 16 03:06:30 03:06:30.168 [ERROR] 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale
>  -- Time elapsed: 0.676 s <<< ERROR!
> Sep 16 03:06:30 java.lang.IndexOutOfBoundsException: Index: -1
> Sep 16 03:06:30       at 
> java.base/java.util.Collections$EmptyList.get(Collections.java:4586)
> Sep 16 03:06:30       at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:214)
> Sep 16 03:06:30       at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> Sep 16 03:06:30       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> Sep 16 03:06:30
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to