[ 
https://issues.apache.org/jira/browse/FLINK-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527486#comment-17527486
 ] 

Matthias Pohl edited comment on FLINK-27354 at 4/25/22 1:33 PM:
----------------------------------------------------------------

The retry mechanism is scheduled using the {{rpcService}} of the {{JobMaster}} 
(see 
[JobManster:1291|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1291]).
 The behavior is as described in the issue description: The {{JobMaster}} is 
deregistered in the {{ResourceManager}}. The RM informs the {{JobMaster}} about 
the disconnect. The {{JobMaster}} will try to reconnect to the 
{{ResourceManager}}. The {{StandaloneResourceManager}} is able to process the 
RPC calls by returning a "{{RpcConnectionException: Could not connect to rpc 
endpoint under address}}" error after some time resulting in the repetition of 
{{"Registering job manager [...] failed}}".

Internally, a {{RetryingRegistration}} is used in the ResourceManagerConnection 
(see 
[JobMaster:1285|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1285]).
 The initial invoke is triggered with a quite small timeout of 100ms (derived 
from 
[cluster.registration.initial-timeout|https://github.com/apache/flink/blob/e921c4c34b5497f4ba723ddae58750f6778069fa/flink-core/src/main/java/org/apache/flink/configuration/ClusterOptions.java#L41]).
 This fails and we end up in a exponentially growing error handling (see 
[RetryingRegistration:281|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L281]).
 The timeout grows exponentially (see 
[RetryingRegistration:297|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L297])
 because that's how timeouts are handled. This can be observed in the logs as 
well and explains the multiple log messages.

The retry mechanism has no boundary and will go on forever.


was (Author: mapohl):
The retry mechanism is scheduled using the {{rpcService}} of the {{JobMaster}} 
(see 
[JobManster:1291|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1291]).
 The behavior is as described in the issue description: The {{JobMaster}} is 
deregistered in the {{ResourceManager}}. The RM informs the {{JobMaster}} about 
the disconnect. The {{JobMaster}} will try to reconnect to the 
{{ResourceManager}}. The {{StandaloneResourceManager}} is able to process the 
RPC calls by returning a "{{RpcConnectionException: Could not connect to rpc 
endpoint under address}}" error after some time resulting in the repetition of 
{{"Registering job manager [...] failed}}".

Internally, a {{RetryingRegistration}} is used in the ResourceManagerConnection 
(see 
[JobMaster:1285|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1285]).
 The initial invoke is triggered with a quite small timeout of 100ms (derived 
from 
[cluster.registration.initial-timeout|https://github.com/apache/flink/blob/e921c4c34b5497f4ba723ddae58750f6778069fa/flink-core/src/main/java/org/apache/flink/configuration/ClusterOptions.java#L41]).
 This fails and we end up in a exponentially growing error handling (see 
[RetryingRegistration:281|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L281]).
 The timeout grows exponentially (see 
[RetryingRegistration:297|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L297])
 because that's how timeouts are handled. This can be observed in the logs as 
well and explains the multiple log messages.

> JobMaster still processes requests while terminating
> ----------------------------------------------------
>
>                 Key: FLINK-27354
>                 URL: https://issues.apache.org/jira/browse/FLINK-27354
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.13.6, 1.14.4
>            Reporter: Matthias Pohl
>            Priority: Major
>         Attachments: flink-logs.zip
>
>
> An issue was reported in the [user 
> ML|https://lists.apache.org/thread/5pm3crntmb1hl17h4txnlhjz34clghrg] about 
> the JobMaster trying to reconnect to the ResourceManager during shutdown.
> The JobMaster is disconnecting from the ResourceManager during shutdown (see 
> [JobMaster:1182|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1182]).
>  This triggers the deregistration of the job in the {{ResourceManager}}. The 
> RM responses asynchronously at the end of this deregistration through 
> {{disconnectResourceManager}} (see 
> [ResourceManager:993|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L993])
>  which will trigger a reconnect on the {{JobMaster}}'s side (see 
> [JobMaster::disconnectResourceManager|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L789])
>  if it's still around because the {{resourceManagerAddress}} (used in 
> {{isConnectingToResourceManager}}) is not cleared. This would only happen 
> during a RM leader change.
> The {{disconnectResourceManager}} will be ignored if the {{JobMaster}} is 
> gone already.
> We should add a guard in some way to {{JobMaster}} to avoid reconnecting to 
> other components during shutdown. This might not only include the 
> ResourceManager connection but might also affect other parts of the 
> {{JobMaster}} API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to