spuru9 commented on code in PR #1131:
URL:
https://github.com/apache/flink-kubernetes-operator/pull/1131#discussion_r3355815656
##########
flink-autoscaler/src/test/java/org/apache/flink/autoscaler/RestApiMetricsCollectorTest.java:
##########
@@ -164,32 +164,29 @@ void testJmMetricCollection() throws Exception {
new StandaloneClientHAServices(
miniCluster.getRestAddress().get().toString()));
var collector = new RestApiMetricsCollector<>();
- Map<FlinkMetric, Metric> flinkMetricMetricMap = new HashMap<>();
- // Metrics might not be available yet so retry the query until it
returns results or the
- // timeout reached.
+ // Metrics might not be available yet, and the JobManager exposes
the slot metrics
+ // before the TaskManager has registered its slots (reporting 0).
Retry the assertion
+ // until the slots are registered, or the timeout is reached.
await().atMost(Duration.ofSeconds(60))
- .until(
+ .untilAsserted(
() -> {
- final Map<FlinkMetric, Metric> results =
+ final Map<FlinkMetric, Metric>
flinkMetricMetricMap =
collector.queryJmMetrics(
client,
Map.of(
"taskSlotsTotal",
FlinkMetric.NUM_TASK_SLOTS_TOTAL,
"taskSlotsAvailable",
FlinkMetric.NUM_TASK_SLOTS_AVAILABLE));
- flinkMetricMetricMap.putAll(results);
- return !results.isEmpty();
+ assertThat(flinkMetricMetricMap)
Review Comment:
The 1.20.4 update leads to a failure in this test
As per analysis (by CC)
```
FLINK-38703 — "ConcurrentModificationException in
FineGrainedResourceManager during reporting metrics" — is the matching fix in
1.20.4 (it regressed in 1.20.3). Looking at the 1.20 backport (PR #27393), it
changed the slot-count gauges (taskSlotsAvailable/taskSlotsTotal):
- Before: the gauge computed the value live (getNumberFreeSlots()), which
under the bug threw a ConcurrentModificationException while the cluster was
still stabilizing — so the metric failed to serialize and
effectively wasn't returned yet. That's exactly why the old "wait until
non-empty" worked: it kept retrying through the unstable window.
- After: the gauge reads a cached field
(lastNumberFreeSlots/lastNumberRegisteredSlots) refreshed every second on the
main thread, initialized to 0. No exception → the metric is now returned as 0
straight
away → the old wait grabs the early 0.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]