spuru9 commented on code in PR #1131:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/1131#discussion_r3355815656


##########
flink-autoscaler/src/test/java/org/apache/flink/autoscaler/RestApiMetricsCollectorTest.java:
##########
@@ -164,32 +164,29 @@ void testJmMetricCollection() throws Exception {
                                     new StandaloneClientHAServices(
                                             
miniCluster.getRestAddress().get().toString()));
             var collector = new RestApiMetricsCollector<>();
-            Map<FlinkMetric, Metric> flinkMetricMetricMap = new HashMap<>();
-            // Metrics might not be available yet so retry the query until it 
returns results or the
-            // timeout reached.
+            // Metrics might not be available yet, and the JobManager exposes 
the slot metrics
+            // before the TaskManager has registered its slots (reporting 0). 
Retry the assertion
+            // until the slots are registered, or the timeout is reached.
             await().atMost(Duration.ofSeconds(60))
-                    .until(
+                    .untilAsserted(
                             () -> {
-                                final Map<FlinkMetric, Metric> results =
+                                final Map<FlinkMetric, Metric> 
flinkMetricMetricMap =
                                         collector.queryJmMetrics(
                                                 client,
                                                 Map.of(
                                                         "taskSlotsTotal",
                                                         
FlinkMetric.NUM_TASK_SLOTS_TOTAL,
                                                         "taskSlotsAvailable",
                                                         
FlinkMetric.NUM_TASK_SLOTS_AVAILABLE));
-                                flinkMetricMetricMap.putAll(results);
-                                return !results.isEmpty();
+                                assertThat(flinkMetricMetricMap)

Review Comment:
   The 1.20.4 update leads to a failure in this test
   As per analysis (by CC)
   ```
     FLINK-38703 — "ConcurrentModificationException in 
FineGrainedResourceManager during reporting metrics" — is the matching fix in 
1.20.4 (it regressed in 1.20.3). Looking at the 1.20 backport (PR #27393), it
     changed the slot-count gauges (taskSlotsAvailable/taskSlotsTotal):
     - Before: the gauge computed the value live (getNumberFreeSlots()), which 
under the bug threw a ConcurrentModificationException while the cluster was 
still stabilizing — so the metric failed to serialize and
     effectively wasn't returned yet. That's exactly why the old "wait until 
non-empty" worked: it kept retrying through the unstable window.
     - After: the gauge reads a cached field 
(lastNumberFreeSlots/lastNumberRegisteredSlots) refreshed every second on the 
main thread, initialized to 0. No exception → the metric is now returned as 0 
straight
     away → the old wait grabs the early 0.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to