kerneltime opened a new pull request, #10470: URL: https://github.com/apache/ozone/pull/10470
## What changes were proposed in this pull request? This PR addresses [HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514): Datanode and OzoneManager fail to recover from SCM peer IP changes; cache stale `InetSocketAddress` for process lifetime. In Kubernetes (and any environment where peer pod IPs may change while DNS names remain stable), Ozone DataNodes and OzoneManagers can become permanently disconnected from SCM after an SCM peer pod is rescheduled to a new IP. The DataNode/OM process remains alive but every heartbeat or RPC call dials the now-defunct IP forever. Recovery today requires either a process restart or an external operator that watches SCM pod IPs and force-restarts dependent components. This is the same class of bug HADOOP-17068 fixed for HDFS NameNode HA. This PR mirrors that pattern at the `FailoverProxyProvider` / `EndpointStateMachine` layer in Ozone (one tier above where Hadoop applied the fix, because Ozone's RPC seams live there) for the four inter-component Hadoop-RPC paths, and removes the IP-baking from the two Ratis paths so gRPC's `DnsNameResolver` can re-resolve hostnames on its own. ### Why is this opt-in? The new behaviour is gated by a config flag, **default `false`**, so that existing non-K8s deployments see zero change. Operators in Kubernetes flip it on. This matches the precedent set by: - `dfs.client.failover.resolve-needed` (HADOOP-17068 / HDFS-14118) - `hbase.resolve.hostnames.on.failure` (HBase `ConnectionImplementation`) ``` ozone.client.failover.resolve-needed = false (default) ozone.datanode.scm.heartbeat.address.refresh.threshold = 3 (default; DN-specific) ``` ### Per-path summary | Path | Mechanism | |---|---| | **DN → SCM heartbeat** | `EndpointStateMachine` preserves `hostAndPort` string. `HeartbeatEndpointTask` catch block calls `maybeRefreshScmAddress` when `missedCount` ≥ threshold. `SCMConnectionManager.refreshSCMServer` swaps the endpoint atomically; the new endpoint starts in `GETVERSION` state — correct because a rescheduled peer is effectively a fresh process. | | **OM → SCM** | `SCMProxyInfo` retains the config-time host:port. `SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId)` runs in `shouldRetry` when the exception chain contains `ConnectException`, `NoRouteToHostException`, or `UnknownHostException`. Stale proxy is stopped via `RPC.stopProxy`. | | **Client → OM (Hadoop RPC)** | `OMProxyInfo.rpcAddr` becomes mutable behind the existing monitor. `refreshAddressIfChanged()` re-resolves `rpcAddrStr`, swaps `rpcAddr` and the derived `dtService`, nulls the cached proxy so the next `createProxyIfNeeded` dials the new IP. `OMFailoverProxyProviderBase.shouldRetry` calls this on connection-class exceptions before advancing the failover index. | | **Client → OM (gRPC)** | No code change — `GrpcOMFailoverProxyProvider` passes a placeholder `InetSocketAddress(0)` and lets gRPC's `NameResolver` re-resolve hostnames on its own schedule. | | **OM ↔ OM control plane** | Uses Hadoop RPC via `OMInterServiceProtocol`, not Ratis. Recovers transitively via the Client → OM fix. | | **OM ↔ OM Ratis replication** | `OzoneManagerRatisServer.createRaftPeer` simplified to always pass a hostname:port string to `RaftPeer.setAddress` — never a resolved IP. Two of three previous `createRaftPeer` branches were calling `new InetSocketAddress(omNode.getInetAddress(), ratisPort)`, which strips the hostname and freezes the IP. With hostname-only addresses, gRPC's default `DnsNameResolver` (used by Ratis under the hood) re-resolves on connection failure / on its own refresh schedule. **No Ratis upstream change required.** | | **SCM ↔ SCM Ratis replication** | Already used hostname strings; removes a misleading `// TODO : Should we use IP instead of hostname??` comment in `SCMRatisServerImpl.buildRaftGroup` and `SCMHAManagerImpl` and replaces with explanatory comments. | ### Connection-class exception filter Re-resolution is gated on exception types where DNS could plausibly help: - `java.net.ConnectException` — connection refused / unreachable - `java.net.NoRouteToHostException` — host route gone - `java.net.UnknownHostException` — DNS lookup failed downstream This excludes application-level errors (`OMNotLeaderException`, `RetryAction`, `OMException`, `AccessControlException`) where SCM/OM is reachable on the cached IP and the failure is logical, not network. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-15514 ## How was this patch tested? **13 new unit tests + 1 real-RPC integration test, all passing under `mvn clean test` on the latest master:** | Test class | Tests | Coverage | |---|---|---| | `TestSCMConnectionManager` | +5 new | `resolveLatestAddress` edge cases, `refreshSCMServer` happy-path swap, no-op when IP unchanged, no-op when host:port not preserved (legacy ctor path) | | `TestSCMFailoverProxyProviderRefresh` | 3 new | Swap on IP change, no-op when unchanged, no-op when `hostAndPort` not preserved | | `TestOMProxyInfoDnsRefresh` | 3 new | Address swap, dtService update, proxy null-out, proxy rebuild after refresh | | `TestSCMConnectionManagerDnsRefreshE2E` | 1 new | Real Hadoop RPC server (via `ScmTestMock`) on a real loopback socket. Connection manager primed with deliberately stale `127.0.0.99` and preserved hostname `localhost:port`. `refreshSCMServer` fires; a real `sendHeartbeat` round-trips to the live server; `ScmTestMock.rpcCount` increments. Proves the full chain: address swap → fresh RPC proxy → real socket dial → server-side handler invocation. | | `TestOzoneManagerRatisServer` | +1 new | Asserts `RaftPeer.getAddress()` is a hostname:port string, never an IP:port string. Defensive regex check that the host portion is not a numeric IPv4. | **Existing regression tests (no failures):** `TestSCMConnectionManager` (1 prior) + `TestEndPoint` (17) + `TestOMFailoverProxyProvider` (8) + `TestOMFailovers` (1) + `TestOzoneManagerRatisServer` (5 prior) — all green. **docker-compose validation** with the `ozone-ha` stack confirmed: - Build pipeline correctly bundles the new code into `hdds-container-service-2.2.0-SNAPSHOT.jar` (verified via `javap`). - Config flag reaches the DN's process environment via the `OZONE-SITE.XML_*` mechanism. - The cluster boots, DataNode registers with SCM HA quorum, and writes succeed under the opt-in flag. - After SCM IP rotation (via `docker network disconnect/connect` with a squatter on the old IP), the post-HADOOP-17068 `Client.updateAddress()` recovery in Hadoop common also fires (visible as `WARN ipc_.Client: Address change detected. Old: scm1/192.168.97.3:9861 New: scm1/192.168.97.12:9861`). My fix is the load-bearing recovery for the **AWS EC2/EKS silent-drop** scenario where `updateAddress()` does not fire because the connect never returns within the IPC retry budget. ## Scope and known limitations - The DN refresh fires from the `HEARTBEAT` phase via `HeartbeatEndpointTask`. If a DataNode starts up with the SCM peer already at a stale IP and never reaches `HEARTBEAT`, the recovery path does not engage. Initial-bringup DNS staleness is the existing concern of HDDS-5919's `ozone.network.jvm.address.cache.enabled=false`. `InitDatanodeState.java:94-101` already postpones initialization on initial-resolution failure. - HDFS-14118-style construction-time DNS fan-out (one hostname → multiple persistent IPs) is a different problem (round-robin DNS for HDFS HA) and out of scope here. Worth a follow-on JIRA if Ozone deployments need it. - The Ratis quorum-loss exit-0 issue (`SCMStateMachine.close()` calling `ExitUtils.terminate(0, ...)` when leader election fails to converge, leading to Kubernetes CrashLoopBackOff death spirals) is a separate concern worth its own JIRA. ## References - HADOOP-17068: client fails forever when namenode ipaddr changed (Hadoop 3.4.0). Commit `fa14e4bc001e28d9912e8d985d09bab75aedb87c`. - HDFS-14118: introduces `dfs.client.failover.resolve-needed` and the `AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary` hook. - HBase: `hbase.resolve.hostnames.on.failure` (`ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY`). - ZOOKEEPER-1506, ZOOKEEPER-2982: ZK `StaticHostProvider` re-resolves on each `next()` call. - HDDS-5919: introduces `ozone.network.jvm.address.cache.enabled` (default `true`). JVM-level DNS cache TTL — necessary but not sufficient for the long-lived `InetSocketAddress` problem this PR fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
