kerneltime opened a new pull request, #10487: URL: https://github.com/apache/ozone/pull/10487
## What changes were proposed in this pull request? This is **PR 3 of 4** splitting [HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback). This PR fixes the **OM → SCM Hadoop-RPC path** by applying the same DNS-refresh-on-failure pattern that PR #10486 introduced for Client → OM. The shape is symmetric: `SCMProxyInfo` gains a preserved hostname, `SCMFailoverProxyProviderBase.shouldRetry` invokes a new `refreshProxyAddressIfChanged(nodeId)` on connection-class exceptions, and the same atomic-replace + leader-pin discipline applies. > **Stacked on [#10486](https://github.com/apache/ozone/pull/10486)** (PR-2 of 4 — Client → OM DNS refresh + shared infrastructure). Please review PR-2 first; the `ConnectionFailureUtils` classifier and the `ozone.client.failover.resolve-needed` flag both come from PR-2 and are reused here. ## Why this matters `SCMProxyInfo` constructs an `InetSocketAddress` from the configured `host:port` once at process start, then re-uses it across every OM-side outbound RPC to that SCM. When an SCM pod is rescheduled in Kubernetes, every subsequent OM → SCM call dials the gone-away IP forever. The failover ring walks past the broken peer to the next SCM, but the broken peer never recovers without an OM process restart. Before this PR, OM → SCM was the **second** of three Ozone seams that exhibit this bug (after Client → OM, fixed in PR-2). The remaining seam — DN → SCM heartbeat — is fixed in PR-4. ## What this PR does ### `SCMProxyInfo` becomes refreshable | Change | Why | |---|---| | New `final String hostAndPort` field, populated from the config-time `host:port` string. | The string is the source of truth for re-resolution; the `InetSocketAddress` is a derived cache. | | New `getHostAndPort()` accessor. | Required by the provider's refresh path. | | `rpcAddr` becomes mutable behind a monitor (was effectively final). Atomic swap of `rpcAddr` happens under the entry monitor; DNS lookup happens **outside** any lock. | A slow / dead resolver under lock would freeze every concurrent caller of `getProxy()` / `getAddress()`. | ### `SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId)` Three-phase refresh: 1. **Outside any lock**: re-resolve `hostAndPort` via `NetUtils.createSocketAddr`. Compare resolved IP with cached. If unchanged or unresolved, return false. 2. **Under the provider monitor**: capture the stale proxy reference, swap `rpcAddr` and clear the cached proxy. The next `getProxy(nodeId)` call rebuilds via the existing `createSCMProxy(nodeId)` path. 3. **Outside the monitor**: stop the stale proxy via `RPC.stopProxy(staleProxy)`. ### `shouldRetry` wiring After the existing `performFailoverToAssignedLeader(null, e)` decision and before the failover-index advance, when: - `ozone.client.failover.resolve-needed=true`, AND - `ConnectionFailureUtils.isConnectionFailure(exception)` matches (the same classifier landed in PR-2), the provider calls `refreshProxyAddressIfChanged(getCurrentProxySCMNodeId())`. On a successful refresh, the provider: - Returns `FAILOVER_AND_RETRY`, AND - Sets `updatedLeaderNodeID` to the just-refreshed nodeId so `RetryInvocationHandler.performFailover` does NOT advance past the fixed peer. Without that pin, an N-peer SCM HA cluster would skip the now-fixed SCM for up to N-1 attempts. ## Atomic-replace and lock discipline The provider follows the same pattern PR-2 established for OMProxyInfo: - DNS lookup outside the monitor — never block other callers behind `getaddrinfo`. - New `dtService` (or in this case, the new resolved address) built outside the monitor — a throw aborts cleanly with no half-swap. - Stale proxy stopped outside the monitor — `RPC.stopProxy` blocks on socket teardown; holding the monitor across that would stall every concurrent `getProxy()`. - Re-check inside the monitor that the snapshot we resolved is still current (defends against a concurrent `loadConfigs`-driven swap). ## How was this patch tested? | Test class | Count | Coverage | |---|---|---| | `TestSCMFailoverProxyProviderRefresh` (new) | 3 | Per-instance refresh: swap on IP change, no-op when unchanged, no-op when `hostAndPort` not preserved (legacy path). | | `TestSCMFailoverProxyProviderRefreshWired` (new) | 5 | End-to-end retry path. Connection-class exceptions (`ConnectException`, `SocketTimeoutException`) trigger refresh; application-level exceptions and `RetryAction`-coded responses do NOT; flag-off does NOT; after a successful refresh, `updatedLeaderNodeID` stays pinned. | Existing `TestSCMFailoverProxyProvider` regression suite verified non-regressed. ## Scope of this PR - **Includes:** OM → SCM Hadoop RPC. - **Out of scope (PR-4):** DN → SCM heartbeat. The DN heartbeat path uses `EndpointStateMachine` / `SCMConnectionManager` — a different abstraction than `SCMFailoverProxyProviderBase` because the DN does not failover (a DN heartbeats every SCM, not just the leader). PR-4 brings its own atomic-replace machinery and a separate threshold knob (`ozone.datanode.scm.heartbeat.address.refresh.threshold`). ## What is the link to the Apache JIRA? https://issues.apache.org/jira/browse/HDDS-15514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
