kerneltime commented on code in PR #10470:
URL: https://github.com/apache/ozone/pull/10470#discussion_r3382968892
##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/HeartbeatEndpointTask.java:
##########
@@ -157,12 +171,78 @@ public EndpointStateMachine.EndPointStates call() throws
Exception {
// put back the reports which failed to be sent
putBackIncrementalReports(requestBuilder);
rpcEndpoint.logIfNeeded(ex);
+ maybeRefreshScmAddress(ex);
} finally {
rpcEndpoint.unlock();
}
return rpcEndpoint.getState();
}
+ /**
+ * After a heartbeat IOException, if (a) DNS-refresh-on-failure is
+ * enabled, (b) the exception's cause chain contains a connection-class
+ * type (per {@link ConnectionFailureUtils#isConnectionFailure}), and
+ * (c) the missed-heartbeat counter has reached the configured
+ * threshold, ask {@link
org.apache.hadoop.ozone.container.common.statemachine.SCMConnectionManager}
+ * to re-resolve this peer's hostname. If the resolved IP differs from
+ * the cached one, the connection manager swaps the endpoint atomically
+ * for a fresh one bound to the new address. The replacement starts in
+ * GETVERSION state, which is the correct behavior when the peer pod
+ * has been recreated -- the new SCM is effectively a fresh process and
+ * needs the version handshake plus DN re-registration.
+ * <p>
+ * Symmetric with the OM/SCM client-side failover providers, which
+ * also gate refresh on connection-class exceptions: an
+ * {@code AccessControlException} or {@code OMException} arriving
+ * via the heartbeat path indicates the cached IP is reachable but
+ * the peer is rejecting us at the application layer -- DNS won't help.
+ * <p>
+ * Off by default. Enable via
+ * {@code ozone.client.failover.resolve-needed=true}.
+ */
+ private void maybeRefreshScmAddress(IOException heartbeatFailure) {
+ if (!resolveOnFailureEnabled) {
+ return;
+ }
+ if (!ConnectionFailureUtils.isConnectionFailure(heartbeatFailure)) {
+ return;
+ }
+ if (rpcEndpoint.getMissedCount() < refreshThreshold) {
+ return;
+ }
+ String hostAndPort = rpcEndpoint.getHostAndPort();
+ if (hostAndPort == null) {
Review Comment:
Fixed in . Inlined and removed the unused local.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]