[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531222#comment-17531222
 ] 

Nicolaus Weidner commented on FLINK-27420:
------------------------------------------

Looks fine on master, but in the backports, a test was backported without 
checking that the same variables are available. E.g. on release-1.15,  
delegationTokenManager is undefined: 
https://github.com/apache/flink/blob/e0c82d6d52871dbbea70c9b41384d2d33179bec0/flink-runtime/src/test/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImplTest.java#L520.
 It was added in this commit: 
https://github.com/apache/flink/commit/26aa543b3bbe2b606bbc6d332a2ef7c5b46d25eb

I didn't check for the specific issue on release-1.14

> Suspended SlotManager fail to reregister metrics when started again
> -------------------------------------------------------------------
>
>                 Key: FLINK-27420
>                 URL: https://issues.apache.org/jira/browse/FLINK-27420
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Metrics
>    Affects Versions: 1.13.5
>            Reporter: Ben Augarten
>            Assignee: Ben Augarten
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.16.0, 1.14.5, 1.15.1
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to