[PR] HDDS-15514. DNS refresh on connection failure for OM to SCM [ozone]

via GitHub Wed, 10 Jun 2026 22:16:07 -0700


kerneltime opened a new pull request, #10487:
URL: https://github.com/apache/ozone/pull/10487


   ## What changes were proposed in this pull request?
   
   This is **PR 3 of 4** splitting 
[HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally 
proposed as a single ~160KB patch in #10473, split per @szetszwo's review 
feedback).
   
   This PR fixes the **OM → SCM Hadoop-RPC path** by applying the same 
DNS-refresh-on-failure pattern that PR #10486 introduced for Client → OM. The 
shape is symmetric: `SCMProxyInfo` gains a preserved hostname, 
`SCMFailoverProxyProviderBase.shouldRetry` invokes a new 
`refreshProxyAddressIfChanged(nodeId)` on connection-class exceptions, and the 
same atomic-replace + leader-pin discipline applies.
   
   > **Stacked on [#10486](https://github.com/apache/ozone/pull/10486)** (PR-2 
of 4 — Client → OM DNS refresh + shared infrastructure). Please review PR-2 
first; the `ConnectionFailureUtils` classifier and the 
`ozone.client.failover.resolve-needed` flag both come from PR-2 and are reused 
here.
   
   ## Why this matters
   
   `SCMProxyInfo` constructs an `InetSocketAddress` from the configured 
`host:port` once at process start, then re-uses it across every OM-side 
outbound RPC to that SCM. When an SCM pod is rescheduled in Kubernetes, every 
subsequent OM → SCM call dials the gone-away IP forever. The failover ring 
walks past the broken peer to the next SCM, but the broken peer never recovers 
without an OM process restart.
   
   Before this PR, OM → SCM was the **second** of three Ozone seams that 
exhibit this bug (after Client → OM, fixed in PR-2). The remaining seam — DN → 
SCM heartbeat — is fixed in PR-4.
   
   ## What this PR does
   
   ### `SCMProxyInfo` becomes refreshable
   
   | Change | Why |
   |---|---|
   | New `final String hostAndPort` field, populated from the config-time 
`host:port` string. | The string is the source of truth for re-resolution; the 
`InetSocketAddress` is a derived cache. |
   | New `getHostAndPort()` accessor. | Required by the provider's refresh 
path. |
   | `rpcAddr` becomes mutable behind a monitor (was effectively final). Atomic 
swap of `rpcAddr` happens under the entry monitor; DNS lookup happens 
**outside** any lock. | A slow / dead resolver under lock would freeze every 
concurrent caller of `getProxy()` / `getAddress()`. |
   
   ### `SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId)`
   
   Three-phase refresh:
   
   1. **Outside any lock**: re-resolve `hostAndPort` via 
`NetUtils.createSocketAddr`. Compare resolved IP with cached. If unchanged or 
unresolved, return false.
   2. **Under the provider monitor**: capture the stale proxy reference, swap 
`rpcAddr` and clear the cached proxy. The next `getProxy(nodeId)` call rebuilds 
via the existing `createSCMProxy(nodeId)` path.
   3. **Outside the monitor**: stop the stale proxy via 
`RPC.stopProxy(staleProxy)`.
   
   ### `shouldRetry` wiring
   
   After the existing `performFailoverToAssignedLeader(null, e)` decision and 
before the failover-index advance, when:
   
   - `ozone.client.failover.resolve-needed=true`, AND
   - `ConnectionFailureUtils.isConnectionFailure(exception)` matches (the same 
classifier landed in PR-2),
   
   the provider calls 
`refreshProxyAddressIfChanged(getCurrentProxySCMNodeId())`. On a successful 
refresh, the provider:
   
   - Returns `FAILOVER_AND_RETRY`, AND
   - Sets `updatedLeaderNodeID` to the just-refreshed nodeId so 
`RetryInvocationHandler.performFailover` does NOT advance past the fixed peer.
   
   Without that pin, an N-peer SCM HA cluster would skip the now-fixed SCM for 
up to N-1 attempts.
   
   ## Atomic-replace and lock discipline
   
   The provider follows the same pattern PR-2 established for OMProxyInfo:
   
   - DNS lookup outside the monitor — never block other callers behind 
`getaddrinfo`.
   - New `dtService` (or in this case, the new resolved address) built outside 
the monitor — a throw aborts cleanly with no half-swap.
   - Stale proxy stopped outside the monitor — `RPC.stopProxy` blocks on socket 
teardown; holding the monitor across that would stall every concurrent 
`getProxy()`.
   - Re-check inside the monitor that the snapshot we resolved is still current 
(defends against a concurrent `loadConfigs`-driven swap).
   
   ## How was this patch tested?
   
   | Test class | Count | Coverage |
   |---|---|---|
   | `TestSCMFailoverProxyProviderRefresh` (new) | 3 | Per-instance refresh: 
swap on IP change, no-op when unchanged, no-op when `hostAndPort` not preserved 
(legacy path). |
   | `TestSCMFailoverProxyProviderRefreshWired` (new) | 5 | End-to-end retry 
path. Connection-class exceptions (`ConnectException`, 
`SocketTimeoutException`) trigger refresh; application-level exceptions and 
`RetryAction`-coded responses do NOT; flag-off does NOT; after a successful 
refresh, `updatedLeaderNodeID` stays pinned. |
   
   Existing `TestSCMFailoverProxyProvider` regression suite verified 
non-regressed.
   
   ## Scope of this PR
   
   - **Includes:** OM → SCM Hadoop RPC.
   - **Out of scope (PR-4):** DN → SCM heartbeat. The DN heartbeat path uses 
`EndpointStateMachine` / `SCMConnectionManager` — a different abstraction than 
`SCMFailoverProxyProviderBase` because the DN does not failover (a DN 
heartbeats every SCM, not just the leader). PR-4 brings its own atomic-replace 
machinery and a separate threshold knob 
(`ozone.datanode.scm.heartbeat.address.refresh.threshold`).
   
   ## What is the link to the Apache JIRA?
   
   https://issues.apache.org/jira/browse/HDDS-15514
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-15514. DNS refresh on connection failure for OM to SCM [ozone]

Reply via email to