[
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529041#comment-17529041
]
Ben Augarten commented on FLINK-27420:
--------------------------------------
Thanks for adding that context!
For 1.14 & 1.15, I looked through the implementation of
`ResourceManagerServiceImpl`, which seems to be new in 1.14, and I follow what
you're saying. I agree that the `resourceManagerMetricGroup` and
`slotManagerMetricGroup` are the only metrics affected and that storing the
`metricRegistry` (and `hostname`, which is required to create the metric group)
and creating the metric group with each new leader session makes the most
sense. I can start on this change and could post a PR tomorrow during US
working hours.
I'm not sure I follow your point about 1.13 though. I currently run 1.13 in
standalone, session mode and the JM process does seem to live through multiple
leader sessions. Regardless, my understanding is that Flink only supports the
latest two versions, which would be 1.14 and 1.15, so a patch for 1.13 is not
desired – is that correct?
> Suspended SlotManager fail to reregister metrics when started again
> -------------------------------------------------------------------
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Metrics
> Affects Versions: 1.13.5
> Reporter: Ben Augarten
> Priority: Major
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and
> taskslotstotal) when a SlotManager is suspended and then restarted. We
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x,
> 1.15.x, and master.
>
> When a SlotManager is suspended, the [metrics group is
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
> When the SlotManager is [started
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
> it makes an attempt to [reregister
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
> but that fails because the underlying metrics group [is still
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>
>
> I was able to trace through this issue by restarting zookeeper nodes in a
> staging environment and watching the JM with a debugger.
>
> A concise test, which currently fails, shows the expected behavior –
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>
> I am happy to provide a PR to fix this issue, but first would like to verify
> that this is not intended.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)