Copilot commented on code in PR #10470:
URL: https://github.com/apache/ozone/pull/10470#discussion_r3378839393


##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java:
##########
@@ -165,14 +180,61 @@ public void addSCMServer(InetSocketAddress address,
               rpcProxy);
 
       EndpointStateMachine endPoint = new EndpointStateMachine(address,
-          rpcClient, this.conf, threadNamePrefix);
+          hostAndPort, rpcClient, this.conf, threadNamePrefix);
       endPoint.setPassive(false);
       scmMachines.put(address, endPoint);
     } finally {
       writeUnlock();
     }
   }
 
+  /**
+   * Re-resolve the SCM hostname for the endpoint at {@code oldAddress} and,
+   * if the resolved IP has changed, atomically replace the endpoint with a
+   * fresh one bound to the new address.
+   * <p>
+   * Returns the new {@link InetSocketAddress} on a successful swap, or null
+   * if no swap occurred (endpoint not found, hostname not preserved at
+   * construction, IP unchanged, or DNS lookup failed). Callers receive
+   * enough information to update any external maps keyed by the old
+   * address.
+   * <p>
+   * The replacement is built via the existing {@link #addSCMServer} path,
+   * so the new endpoint starts in {@code GETVERSION} state and re-walks
+   * version → register → heartbeat. This is the correct behavior: a peer
+   * that has been rescheduled is effectively a fresh process.
+   */
+  public InetSocketAddress refreshSCMServer(InetSocketAddress oldAddress,
+      String threadNamePrefix) throws IOException {
+    writeLock();
+    try {
+      EndpointStateMachine existing = scmMachines.get(oldAddress);
+      if (existing == null) {
+        return null;
+      }
+      InetSocketAddress refreshed = existing.resolveLatestAddress();
+      if (refreshed == null) {
+        return null;
+      }
+      // Tear down before adding to avoid two endpoints claiming the same
+      // logical SCM peer.
+      String hostAndPort = existing.getHostAndPort();
+      scmMachines.remove(oldAddress);
+      try {
+        existing.close();
+      } catch (RuntimeException closeEx) {
+        LOG.warn("Failed to close stale endpoint {}: {}", oldAddress,
+            closeEx.getMessage());
+      }
+      addSCMServer(refreshed, hostAndPort, threadNamePrefix);
+      LOG.info("DNS re-resolution: SCM endpoint {} -> {} (host {}).",
+          oldAddress, refreshed, hostAndPort);
+      return refreshed;

Review Comment:
   `refreshSCMServer` removes and closes the existing endpoint before 
attempting to create/add the replacement. If `addSCMServer(refreshed, ...)` 
throws (eg transient failure while the new SCM is still starting), the 
connection manager will end up with no endpoint for that peer, which can break 
further heartbeats and recovery attempts. Add the refreshed endpoint first (or 
roll back on failure) and only then remove/close the stale endpoint.



##########
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMProxyInfo.java:
##########
@@ -43,9 +43,26 @@ public final class OMProxyInfo<T> extends ProxyInfo<T> {
   private static final Logger LOG = LoggerFactory.getLogger(OMProxyInfo.class);
 
   private final String nodeId;
+  /**
+   * The original "host:port" config string. Stable for the lifetime of
+   * this OMProxyInfo; used as the source of truth for re-resolving DNS
+   * when the cached IP becomes stale (Kubernetes pod-IP-change recovery).
+   */
   private final String rpcAddrStr;
-  private final InetSocketAddress rpcAddr;
-  private final Text dtService;
+  /**
+   * The currently-resolved address. Initialized at construction by
+   * resolving {@link #rpcAddrStr}, and may be replaced atomically by
+   * {@link #refreshAddressIfChanged()} when the failover provider
+   * detects that the SCM peer has been rescheduled to a new IP.

Review Comment:
   This Javadoc refers to the "SCM peer" being rescheduled, but `OMProxyInfo` 
tracks OM addresses; the comment should say OM (or "OM node") to avoid 
confusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to