[PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

via GitHub Tue, 09 Jun 2026 00:49:32 -0700


kerneltime opened a new pull request, #10470:
URL: https://github.com/apache/ozone/pull/10470


   ## What changes were proposed in this pull request?
   
   This PR addresses 
[HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514): Datanode and 
OzoneManager fail to recover from SCM peer IP changes; cache stale 
`InetSocketAddress` for process lifetime.
   
   In Kubernetes (and any environment where peer pod IPs may change while DNS 
names remain stable), Ozone DataNodes and OzoneManagers can become permanently 
disconnected from SCM after an SCM peer pod is rescheduled to a new IP. The 
DataNode/OM process remains alive but every heartbeat or RPC call dials the 
now-defunct IP forever. Recovery today requires either a process restart or an 
external operator that watches SCM pod IPs and force-restarts dependent 
components.
   
   This is the same class of bug HADOOP-17068 fixed for HDFS NameNode HA. This 
PR mirrors that pattern at the `FailoverProxyProvider` / `EndpointStateMachine` 
layer in Ozone (one tier above where Hadoop applied the fix, because Ozone's 
RPC seams live there) for the four inter-component Hadoop-RPC paths, and 
removes the IP-baking from the two Ratis paths so gRPC's `DnsNameResolver` can 
re-resolve hostnames on its own.
   
   ### Why is this opt-in?
   
   The new behaviour is gated by a config flag, **default `false`**, so that 
existing non-K8s deployments see zero change. Operators in Kubernetes flip it 
on. This matches the precedent set by:
   - `dfs.client.failover.resolve-needed` (HADOOP-17068 / HDFS-14118)
   - `hbase.resolve.hostnames.on.failure` (HBase `ConnectionImplementation`)
   
   ```
   ozone.client.failover.resolve-needed = false   (default)
   ozone.datanode.scm.heartbeat.address.refresh.threshold = 3   (default; 
DN-specific)
   ```
   
   ### Per-path summary
   
   | Path | Mechanism |
   |---|---|
   | **DN → SCM heartbeat** | `EndpointStateMachine` preserves `hostAndPort` 
string. `HeartbeatEndpointTask` catch block calls `maybeRefreshScmAddress` when 
`missedCount` ≥ threshold. `SCMConnectionManager.refreshSCMServer` swaps the 
endpoint atomically; the new endpoint starts in `GETVERSION` state — correct 
because a rescheduled peer is effectively a fresh process. |
   | **OM → SCM** | `SCMProxyInfo` retains the config-time host:port. 
`SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId)` runs in 
`shouldRetry` when the exception chain contains `ConnectException`, 
`NoRouteToHostException`, or `UnknownHostException`. Stale proxy is stopped via 
`RPC.stopProxy`. |
   | **Client → OM (Hadoop RPC)** | `OMProxyInfo.rpcAddr` becomes mutable 
behind the existing monitor. `refreshAddressIfChanged()` re-resolves 
`rpcAddrStr`, swaps `rpcAddr` and the derived `dtService`, nulls the cached 
proxy so the next `createProxyIfNeeded` dials the new IP. 
`OMFailoverProxyProviderBase.shouldRetry` calls this on connection-class 
exceptions before advancing the failover index. |
   | **Client → OM (gRPC)** | No code change — `GrpcOMFailoverProxyProvider` 
passes a placeholder `InetSocketAddress(0)` and lets gRPC's `NameResolver` 
re-resolve hostnames on its own schedule. |
   | **OM ↔ OM control plane** | Uses Hadoop RPC via `OMInterServiceProtocol`, 
not Ratis. Recovers transitively via the Client → OM fix. |
   | **OM ↔ OM Ratis replication** | `OzoneManagerRatisServer.createRaftPeer` 
simplified to always pass a hostname:port string to `RaftPeer.setAddress` — 
never a resolved IP. Two of three previous `createRaftPeer` branches were 
calling `new InetSocketAddress(omNode.getInetAddress(), ratisPort)`, which 
strips the hostname and freezes the IP. With hostname-only addresses, gRPC's 
default `DnsNameResolver` (used by Ratis under the hood) re-resolves on 
connection failure / on its own refresh schedule. **No Ratis upstream change 
required.** |
   | **SCM ↔ SCM Ratis replication** | Already used hostname strings; removes a 
misleading `// TODO : Should we use IP instead of hostname??` comment in 
`SCMRatisServerImpl.buildRaftGroup` and `SCMHAManagerImpl` and replaces with 
explanatory comments. |
   
   ### Connection-class exception filter
   
   Re-resolution is gated on exception types where DNS could plausibly help:
   - `java.net.ConnectException` — connection refused / unreachable
   - `java.net.NoRouteToHostException` — host route gone
   - `java.net.UnknownHostException` — DNS lookup failed downstream
   
   This excludes application-level errors (`OMNotLeaderException`, 
`RetryAction`, `OMException`, `AccessControlException`) where SCM/OM is 
reachable on the cached IP and the failure is logical, not network.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-15514
   
   ## How was this patch tested?
   
   **13 new unit tests + 1 real-RPC integration test, all passing under `mvn 
clean test` on the latest master:**
   
   | Test class | Tests | Coverage |
   |---|---|---|
   | `TestSCMConnectionManager` | +5 new | `resolveLatestAddress` edge cases, 
`refreshSCMServer` happy-path swap, no-op when IP unchanged, no-op when 
host:port not preserved (legacy ctor path) |
   | `TestSCMFailoverProxyProviderRefresh` | 3 new | Swap on IP change, no-op 
when unchanged, no-op when `hostAndPort` not preserved |
   | `TestOMProxyInfoDnsRefresh` | 3 new | Address swap, dtService update, 
proxy null-out, proxy rebuild after refresh |
   | `TestSCMConnectionManagerDnsRefreshE2E` | 1 new | Real Hadoop RPC server 
(via `ScmTestMock`) on a real loopback socket. Connection manager primed with 
deliberately stale `127.0.0.99` and preserved hostname `localhost:port`. 
`refreshSCMServer` fires; a real `sendHeartbeat` round-trips to the live 
server; `ScmTestMock.rpcCount` increments. Proves the full chain: address swap 
→ fresh RPC proxy → real socket dial → server-side handler invocation. |
   | `TestOzoneManagerRatisServer` | +1 new | Asserts `RaftPeer.getAddress()` 
is a hostname:port string, never an IP:port string. Defensive regex check that 
the host portion is not a numeric IPv4. |
   
   **Existing regression tests (no failures):** `TestSCMConnectionManager` (1 
prior) + `TestEndPoint` (17) + `TestOMFailoverProxyProvider` (8) + 
`TestOMFailovers` (1) + `TestOzoneManagerRatisServer` (5 prior) — all green.
   
   **docker-compose validation** with the `ozone-ha` stack confirmed:
   - Build pipeline correctly bundles the new code into 
`hdds-container-service-2.2.0-SNAPSHOT.jar` (verified via `javap`).
   - Config flag reaches the DN's process environment via the 
`OZONE-SITE.XML_*` mechanism.
   - The cluster boots, DataNode registers with SCM HA quorum, and writes 
succeed under the opt-in flag.
   - After SCM IP rotation (via `docker network disconnect/connect` with a 
squatter on the old IP), the post-HADOOP-17068 `Client.updateAddress()` 
recovery in Hadoop common also fires (visible as `WARN ipc_.Client: Address 
change detected. Old: scm1/192.168.97.3:9861 New: scm1/192.168.97.12:9861`). My 
fix is the load-bearing recovery for the **AWS EC2/EKS silent-drop** scenario 
where `updateAddress()` does not fire because the connect never returns within 
the IPC retry budget.
   
   ## Scope and known limitations
   
   - The DN refresh fires from the `HEARTBEAT` phase via 
`HeartbeatEndpointTask`. If a DataNode starts up with the SCM peer already at a 
stale IP and never reaches `HEARTBEAT`, the recovery path does not engage. 
Initial-bringup DNS staleness is the existing concern of HDDS-5919's 
`ozone.network.jvm.address.cache.enabled=false`. 
`InitDatanodeState.java:94-101` already postpones initialization on 
initial-resolution failure.
   - HDFS-14118-style construction-time DNS fan-out (one hostname → multiple 
persistent IPs) is a different problem (round-robin DNS for HDFS HA) and out of 
scope here. Worth a follow-on JIRA if Ozone deployments need it.
   - The Ratis quorum-loss exit-0 issue (`SCMStateMachine.close()` calling 
`ExitUtils.terminate(0, ...)` when leader election fails to converge, leading 
to Kubernetes CrashLoopBackOff death spirals) is a separate concern worth its 
own JIRA.
   
   ## References
   
   - HADOOP-17068: client fails forever when namenode ipaddr changed (Hadoop 
3.4.0). Commit `fa14e4bc001e28d9912e8d985d09bab75aedb87c`.
   - HDFS-14118: introduces `dfs.client.failover.resolve-needed` and the 
`AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary` hook.
   - HBase: `hbase.resolve.hostnames.on.failure` 
(`ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY`).
   - ZOOKEEPER-1506, ZOOKEEPER-2982: ZK `StaticHostProvider` re-resolves on 
each `next()` call.
   - HDDS-5919: introduces `ozone.network.jvm.address.cache.enabled` (default 
`true`). JVM-level DNS cache TTL — necessary but not sufficient for the 
long-lived `InetSocketAddress` problem this PR fixes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths [ozone]

Reply via email to