[
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528495#comment-17528495
]
Xintong Song commented on FLINK-27420:
--------------------------------------
Thanks for reporting this, [~baugarten].
This is indeed a valid issue. I'd like to add a bit more clarification.
* For 1.13, I don't think it is supported for the JM process to live through
multiple leader sessions, i.e. being revoked and re-granted leadership without
failing the process. I know some of the codes look like it is supported, but
unfortunately it never really worked until FLINK-23240 which is fixed in 1.14.4.
* For 1.14 & 1.15, yes, the issue still exist. Since 1.14, for each leader
session we create a new ResourceManager instance. However, some of the
components and services are preserved in {{ResourceManagerProcessContext}} and
are reused across multiple RM instances. If these components / services are
closed, they need to be restarted properly. I've checked the current
implementation, and it seems the only things that are affected are
{{resourceManagerMetricGroup}} and {{slotManagerMetricGroup}}. I think the
easiest way to fix this is probably to store {{metricRegistry}} rather than the
metric groups in {{ResourceManagerProcessContext}}, so that we can create new
metric group instances for each leader session.
WDYT?
> Suspended SlotManager fail to reregister metrics when started again
> -------------------------------------------------------------------
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Metrics
> Affects Versions: 1.13.5
> Reporter: Ben Augarten
> Priority: Major
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and
> taskslotstotal) when a SlotManager is suspended and then restarted. We
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x,
> 1.15.x, and master.
>
> When a SlotManager is suspended, the [metrics group is
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
> When the SlotManager is [started
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
> it makes an attempt to [reregister
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
> but that fails because the underlying metrics group [is still
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>
>
> I was able to trace through this issue by restarting zookeeper nodes in a
> staging environment and watching the JM with a debugger.
>
> A concise test, which currently fails, shows the expected behavior –
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>
> I am happy to provide a PR to fix this issue, but first would like to verify
> that this is not intended.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)