[
https://issues.apache.org/jira/browse/FLINK-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527486#comment-17527486
]
Matthias Pohl edited comment on FLINK-27354 at 4/25/22 1:40 PM:
----------------------------------------------------------------
The retry mechanism is scheduled using the {{rpcService}} of the {{JobMaster}}
(see
[JobManster:1291|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1291]).
The behavior is as described in the issue description: The {{JobMaster}} is
deregistered in the {{ResourceManager}}. The RM informs the {{JobMaster}} about
the disconnect. The {{JobMaster}} will try to reconnect to the
{{ResourceManager}}. The {{StandaloneResourceManager}} is able to process the
RPC calls by returning a "{{RpcConnectionException: Could not connect to rpc
endpoint under address}}" error after some time resulting in the repetition of
{{"Registering job manager [...] failed}}".
Internally, a {{RetryingRegistration}} is used in the ResourceManagerConnection
(see
[JobMaster:1285|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1285]).
The initial invoke is triggered with a quite small timeout of 100ms (derived
from
[cluster.registration.initial-timeout|https://github.com/apache/flink/blob/e921c4c34b5497f4ba723ddae58750f6778069fa/flink-core/src/main/java/org/apache/flink/configuration/ClusterOptions.java#L41]).
This fails and we end up in a exponentially growing error handling (see
[RetryingRegistration:281|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L281]).
The timeout grows exponentially (see
[RetryingRegistration:297|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L297])
because that's how timeouts are handled. This can be observed in the logs as
well and explains the multiple log messages.
The timeouts happen, because on the {{ResourceManager}}'s side, we still try to
connect to the RpcEndpoint of the {{JobMaster}} which is already in the midst
of stopping.
The retry mechanism has no boundary and will go on forever.
was (Author: mapohl):
The retry mechanism is scheduled using the {{rpcService}} of the {{JobMaster}}
(see
[JobManster:1291|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1291]).
The behavior is as described in the issue description: The {{JobMaster}} is
deregistered in the {{ResourceManager}}. The RM informs the {{JobMaster}} about
the disconnect. The {{JobMaster}} will try to reconnect to the
{{ResourceManager}}. The {{StandaloneResourceManager}} is able to process the
RPC calls by returning a "{{RpcConnectionException: Could not connect to rpc
endpoint under address}}" error after some time resulting in the repetition of
{{"Registering job manager [...] failed}}".
Internally, a {{RetryingRegistration}} is used in the ResourceManagerConnection
(see
[JobMaster:1285|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1285]).
The initial invoke is triggered with a quite small timeout of 100ms (derived
from
[cluster.registration.initial-timeout|https://github.com/apache/flink/blob/e921c4c34b5497f4ba723ddae58750f6778069fa/flink-core/src/main/java/org/apache/flink/configuration/ClusterOptions.java#L41]).
This fails and we end up in a exponentially growing error handling (see
[RetryingRegistration:281|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L281]).
The timeout grows exponentially (see
[RetryingRegistration:297|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L297])
because that's how timeouts are handled. This can be observed in the logs as
well and explains the multiple log messages.
The retry mechanism has no boundary and will go on forever.
> JobMaster still processes requests while terminating
> ----------------------------------------------------
>
> Key: FLINK-27354
> URL: https://issues.apache.org/jira/browse/FLINK-27354
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0, 1.13.6, 1.14.4
> Reporter: Matthias Pohl
> Priority: Major
> Attachments: flink-logs.zip
>
>
> An issue was reported in the [user
> ML|https://lists.apache.org/thread/5pm3crntmb1hl17h4txnlhjz34clghrg] about
> the JobMaster trying to reconnect to the ResourceManager during shutdown.
> The JobMaster is disconnecting from the ResourceManager during shutdown (see
> [JobMaster:1182|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1182]).
> This triggers the deregistration of the job in the {{ResourceManager}}. The
> RM responses asynchronously at the end of this deregistration through
> {{disconnectResourceManager}} (see
> [ResourceManager:993|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L993])
> which will trigger a reconnect on the {{JobMaster}}'s side (see
> [JobMaster::disconnectResourceManager|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L789])
> if it's still around because the {{resourceManagerAddress}} (used in
> {{isConnectingToResourceManager}}) is not cleared. This would only happen
> during a RM leader change.
> The {{disconnectResourceManager}} will be ignored if the {{JobMaster}} is
> gone already.
> We should add a guard in some way to {{JobMaster}} to avoid reconnecting to
> other components during shutdown. This might not only include the
> ResourceManager connection but might also affect other parts of the
> {{JobMaster}} API.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)