[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529153#comment-17529153
 ] 

Xintong Song commented on FLINK-27420:
--------------------------------------

[~baugarten],

Thanks for volunteering to fix this. I've assigned you to this ticket.

Regarding 1.13, I'm surprised that you have the JM process live through 
multiple leader sessions. IIRC, we had tried it before making the changes in 
1.14, and JM process was terminated after losing the leadership. Unfortunately 
I cannot recall more details about how it was terminated. Anyway, if that works 
for you, I'd be fine with also fixing this for 1.13.

According to the [Update Policy for old 
releases|https://flink.apache.org/downloads.html#update-policy-for-old-releases],
 the Flink community provides supports for the latest 2 versions, but is also 
open to discussing bugfix releases for older versions. Actually, it is not rare 
that we ship bugfix release for the 3rd latest version. To sum up, although I 
don't know whether (and when) there will be a next bugfix release for 1.13.x, I 
would not consider a patch for 1.13 is not desired.

> Suspended SlotManager fail to reregister metrics when started again
> -------------------------------------------------------------------
>
>                 Key: FLINK-27420
>                 URL: https://issues.apache.org/jira/browse/FLINK-27420
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Metrics
>    Affects Versions: 1.13.5
>            Reporter: Ben Augarten
>            Assignee: Ben Augarten
>            Priority: Major
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to