Ben Augarten created FLINK-27420:
------------------------------------

             Summary: Suspended SlotManagers fail to reregister metrics when 
started again
                 Key: FLINK-27420
                 URL: https://issues.apache.org/jira/browse/FLINK-27420
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Metrics
    Affects Versions: 1.13.5
            Reporter: Ben Augarten


The symptom is that SlotManager metrics are missing (taskslotsavailable and 
taskslotstotal) when a SlotManager is suspended and then restarted. We noticed 
this issue when running 1.13.5, but I believe this impacts 1.14.x, 1.15.x, and 
master.

 

When a SlotManager is suspended, the [metrics group is 
closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
 When the SlotManager is [started 
again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
 it makes an attempt to [reregister 
metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
 but that fails because the underlying metrics group [is still 
closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
 

 

I was able to trace through this issue by restarting zookeeper nodes in a 
staging environment and watching the JM with a debugger. 

 

A concise test, which currently fails, shows the expected behavior – 
[https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]

 

I am happy to provide a PR to fix this issue, but first would like to verify 
that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to