[PR] HDDS-15514. DNS refresh on heartbeat failure for DN to SCM [ozone]

via GitHub Wed, 10 Jun 2026 22:22:45 -0700


kerneltime opened a new pull request, #10488:
URL: https://github.com/apache/ozone/pull/10488


   ## What changes were proposed in this pull request?
   
   This is **PR 4 of 4** splitting 
[HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally 
proposed as a single ~160KB patch in #10473, split per @szetszwo's review 
feedback).
   
   This PR fixes the **DN → SCM heartbeat path** — the largest and most 
invasive of the four split PRs. Unlike the failover-proxy-provider seams, the 
DN does not failover; it heartbeats every SCM in parallel via the 
`EndpointStateMachine` / `SCMConnectionManager` abstraction. The fix introduces:
   
   1. An atomic `EndpointStateMachine` swap when DNS re-resolution detects an 
IP change.
   2. Per-endpoint queue migration in `StateContext` so in-flight reports 
survive the swap.
   3. A separate threshold knob 
(`ozone.datanode.scm.heartbeat.address.refresh.threshold`, default 3) — the 
heartbeat path runs at a much higher cadence than the failover-proxy path, so a 
count-based gate prevents over-reaction to transient blips.
   
   > **Stacked on [#10487](https://github.com/apache/ozone/pull/10487)** (PR-3 
of 4 — OM → SCM DNS refresh). Reuses the `ConnectionFailureUtils` classifier 
and the `ozone.client.failover.resolve-needed` flag landed in [PR-2 
(#10486)](https://github.com/apache/ozone/pull/10486).
   
   ## Why this matters
   
   `EndpointStateMachine.address` is the cached `InetSocketAddress` that the DN 
heartbeat task uses to dial each SCM peer. It is constructed at DN startup from 
the configured `host:port` and never re-resolved. When an SCM pod is 
rescheduled in Kubernetes, every heartbeat to that peer dials the now-defunct 
IP forever. The DN's `endpoints` set still contains the broken peer's 
`EndpointStateMachine`, but that machine never recovers without a DN process 
restart.
   
   This is the path the AWC ozone-operator's existing 7-layer workaround was 
built to defeat: after watching SCM pod IP changes, the operator force-restarts 
every DN. PR-4 is the upstream fix that lets the operator drop those restarts.
   
   ## What this PR does
   
   ### 1. `EndpointStateMachine` preserves a hostname
   
   | Change | Why |
   |---|---|
   | New `final String hostAndPort` field. | Source of truth for re-resolution. 
|
   | `resolveLatestAddress()`: re-resolves `hostAndPort` via 
`NetUtils.createSocketAddr` and returns the freshly-resolved 
`InetSocketAddress` only if its `getAddress()` differs from the cached one. 
Returns null on legacy endpoints (no preserved `hostAndPort`), unresolved DNS, 
or unchanged IP. | Lets the heartbeat task ask "did the IP just change?" 
without committing to a swap. |
   
   ### 2. `SCMConnectionManager.refreshSCMServer` — 4-phase atomic swap
   
   ```
   PHASE A (read lock):       snapshot the endpoint reference and hostAndPort
   PHASE B (no lock):         resolveLatestAddress  (DNS lookup must NEVER hold 
a lock)
   PHASE C (write lock):      re-check snapshot, enforce collision invariant,
                              build replacement endpoint, commit swap
   PHASE D (no lock):         close stale endpoint  (RPC.stopProxy + socket 
teardown)
   ```
   
   Crucial properties (each had a corresponding bug in the original combined PR 
that Copilot's failure-injection lens caught):
   
   - **Build-then-swap, never remove-then-build.** If `buildScmEndpoint` throws 
(transient DNS, peer not yet accepting on the new IP, NetUtils refusing the 
address), the stale endpoint stays registered. Otherwise the peer would 
disappear from `scmMachines` entirely and no heartbeat could recover it. Tested 
by 
`TestSCMConnectionManager.testRefreshSCMServerLeavesStaleEndpointOnBuildFailure`
 using a `@VisibleForTesting` overridable `buildScmEndpoint` hook.
   - **Refuse swaps that collide with another registered peer key.** If 
transient kube-DNS returns peer-B's IP for peer-A's hostname, the swap is 
refused rather than overwriting peer-B's endpoint. Without this, peer-B's 
`EndpointStateMachine` would be silently replaced, leaking its executor and 
orphaning its task thread.
   - **Re-check after DNS lookup.** A concurrent `removeSCMServer` or refresh 
may have raced ahead while we were resolving. The write-lock phase verifies the 
snapshot is still current before swapping.
   - **`close()` outside the lock.** Stale-endpoint teardown blocks on 
`RPC.stopProxy`; holding `writeLock()` across that would stall every concurrent 
heartbeat / reconfiguration.
   
   ### 3. `StateContext.migrateEndpoint` — preserve in-flight reports across 
swap
   
   Per-endpoint queues (`incrementalReportsQueue`, `containerActions`, 
`pipelineActions`, `isFullReportReadyToBeSent`) are keyed by 
`InetSocketAddress`. Without migration, a swap would orphan all queued reports 
for that peer. The migration is ordered to preserve the invariant **"every 
endpoint in `endpoints` has a queue at every observable point"**:
   
   1. **PUBLISH** — install new-key queues alongside the old-key queues.
   2. **SWITCH** — add `newEndpoint` to the endpoints set; remove `oldEndpoint` 
from the endpoints set.
   3. **RETIRE** — drop the old-key queues (no producer can reach them after 
step 2).
   
   `endpoints` is now a `CopyOnWriteArraySet` (was `HashSet`). 
`incrementalReportsQueue`, `containerActions`, `pipelineActions`, and 
`isFullReportReadyToBeSent` are now `ConcurrentHashMap` (some already were). 
Producers null-skip queue lookups as defense-in-depth — a producer racing 
migration MUST NOT NPE on a concurrent `remove`.
   
   The full-report flags get a special case: a swapped endpoint is effectively 
a fresh peer (the new SCM pod has no idea which reports we have already 
shipped), so its `isFullReportReadyToBeSent[type]` flags are seeded **fresh** 
rather than copied from the old key. Tested in 
`TestHeartbeatEndpointTaskDnsRefresh`.
   
   ### 4. `HeartbeatEndpointTask` trigger
   
   In the heartbeat catch block, after `logIfNeeded(ex)`:
   
   ```java
   if (resolveOnFailureEnabled                    // 
ozone.client.failover.resolve-needed
       && missedCount >= refreshThreshold         // 
ozone.datanode.scm.heartbeat.address.refresh.threshold
       && ConnectionFailureUtils.isConnectionFailure(ex)
       && hostAndPort != null) {
     maybeRefreshScmAddress();                    // calls 
SCMConnectionManager.refreshSCMServer
   }
   ```
   
   All four gates are required. Application-level errors don't trigger refresh. 
Endpoints without a preserved hostname (legacy code path) don't trigger. The 
threshold prevents over-reaction to a one-off blip.
   
   ### 5. New config knob
   
   `ozone.datanode.scm.heartbeat.address.refresh.threshold` (default **3**). 
Conservative default — at the typical 30-second heartbeat interval and 6-second 
`socketTimeout`, this means at most ~108 seconds of dialing the stale IP before 
the first DNS retry. In practice the failures are usually fast (TCP RST or 
routing failure), so the recovery is much faster.
   
   ## Real-world failure shapes this fix targets
   
   Two distinct failure modes drove the requirement:
   
   - **AWS EC2 / EKS — silent packet drop.** When a DN attempts to connect to 
the cached IP of `scm-0` after the pod has moved, AWS silently drops the 
packet. The TCP retry loop expires after `socketTimeout` (default 6 seconds in 
Ozone). Without this PR, the DN retries the same dead IP forever. With this PR, 
after `threshold` consecutive `SocketTimeoutException`s, the DN re-resolves DNS 
and swaps to the new IP.
   - **OpenStack — TCP RST or ICMP unreachable.** The network stack 
fast-rejects packets to the dead IP, surfacing as `ConnectException`. Same 
recovery path: after `threshold` consecutive failures, refresh.
   
   ## How was this patch tested?
   
   | Test class | Count | Coverage |
   |---|---|---|
   | `TestSCMConnectionManager` (extended) | 7 (1 prior + 6 new) | 
`resolveLatestAddress` edge cases. `refreshSCMServer` happy-path swap. No-op 
when `hostAndPort` not preserved. **Rollback regression**: when 
`buildScmEndpoint` throws, the stale endpoint remains registered (uses 
`@VisibleForTesting` overridable hook to inject the failure). |
   | `TestHeartbeatEndpointTaskDnsRefresh` (new) | 6 | Production trigger 
chain. `HeartbeatEndpointTask.call()` catch block fires `refreshSCMServer` only 
when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d) 
`hostAndPort` preserved. `AccessControlException` at threshold does NOT 
trigger. After a successful swap, `StateContext`'s incremental-reports map has 
the new key and not the old key. |
   | `TestSCMConnectionManagerDnsRefreshE2E` (new) | 1 (`@Timeout(30)`) | 
Real-RPC swap mechanism. Stands up a real `ScmTestMock` RPC server on a 
loopback OS-assigned port, primes the connection manager with a stale 
`127.0.0.99` cache + preserved `localhost:port`, calls `refreshSCMServer`, 
asserts a real `sendHeartbeat` round-trips through the swapped endpoint. Lives 
in `hadoop-hdds/server-scm` because it depends on `ScmTestMock`. |
   
   Existing regression suite verified non-regressed: `TestEndPoint` (17), 
`TestHeartbeatEndpointTask` (8).
   
   ## Scope and known limitations
   
   - **DN initial bringup with stale DNS**: the refresh fires from the 
`HEARTBEAT` phase via `HeartbeatEndpointTask`. If a DN starts up with the SCM 
peer already at a stale IP and never reaches `HEARTBEAT`, the recovery path 
does not engage. Initial-bringup DNS staleness is the existing concern of 
[HDDS-5919](https://issues.apache.org/jira/browse/HDDS-5919)'s 
`ozone.network.jvm.address.cache.enabled=false`. `InitDatanodeState.java` 
already postpones initialization on initial-resolution failure.
   - **HDFS-14118-style construction-time DNS fan-out** (one hostname → 
multiple persistent IPs, for round-robin DNS HA) is a different problem and out 
of scope. Worth a follow-on JIRA if needed.
   
   ## What is the link to the Apache JIRA?
   
   https://issues.apache.org/jira/browse/HDDS-15514
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-15514. DNS refresh on heartbeat failure for DN to SCM [ozone]

Reply via email to