kerneltime opened a new pull request, #10488: URL: https://github.com/apache/ozone/pull/10488
## What changes were proposed in this pull request? This is **PR 4 of 4** splitting [HDDS-15514](https://issues.apache.org/jira/browse/HDDS-15514) (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback). This PR fixes the **DN → SCM heartbeat path** — the largest and most invasive of the four split PRs. Unlike the failover-proxy-provider seams, the DN does not failover; it heartbeats every SCM in parallel via the `EndpointStateMachine` / `SCMConnectionManager` abstraction. The fix introduces: 1. An atomic `EndpointStateMachine` swap when DNS re-resolution detects an IP change. 2. Per-endpoint queue migration in `StateContext` so in-flight reports survive the swap. 3. A separate threshold knob (`ozone.datanode.scm.heartbeat.address.refresh.threshold`, default 3) — the heartbeat path runs at a much higher cadence than the failover-proxy path, so a count-based gate prevents over-reaction to transient blips. > **Stacked on [#10487](https://github.com/apache/ozone/pull/10487)** (PR-3 of 4 — OM → SCM DNS refresh). Reuses the `ConnectionFailureUtils` classifier and the `ozone.client.failover.resolve-needed` flag landed in [PR-2 (#10486)](https://github.com/apache/ozone/pull/10486). ## Why this matters `EndpointStateMachine.address` is the cached `InetSocketAddress` that the DN heartbeat task uses to dial each SCM peer. It is constructed at DN startup from the configured `host:port` and never re-resolved. When an SCM pod is rescheduled in Kubernetes, every heartbeat to that peer dials the now-defunct IP forever. The DN's `endpoints` set still contains the broken peer's `EndpointStateMachine`, but that machine never recovers without a DN process restart. This is the path the AWC ozone-operator's existing 7-layer workaround was built to defeat: after watching SCM pod IP changes, the operator force-restarts every DN. PR-4 is the upstream fix that lets the operator drop those restarts. ## What this PR does ### 1. `EndpointStateMachine` preserves a hostname | Change | Why | |---|---| | New `final String hostAndPort` field. | Source of truth for re-resolution. | | `resolveLatestAddress()`: re-resolves `hostAndPort` via `NetUtils.createSocketAddr` and returns the freshly-resolved `InetSocketAddress` only if its `getAddress()` differs from the cached one. Returns null on legacy endpoints (no preserved `hostAndPort`), unresolved DNS, or unchanged IP. | Lets the heartbeat task ask "did the IP just change?" without committing to a swap. | ### 2. `SCMConnectionManager.refreshSCMServer` — 4-phase atomic swap ``` PHASE A (read lock): snapshot the endpoint reference and hostAndPort PHASE B (no lock): resolveLatestAddress (DNS lookup must NEVER hold a lock) PHASE C (write lock): re-check snapshot, enforce collision invariant, build replacement endpoint, commit swap PHASE D (no lock): close stale endpoint (RPC.stopProxy + socket teardown) ``` Crucial properties (each had a corresponding bug in the original combined PR that Copilot's failure-injection lens caught): - **Build-then-swap, never remove-then-build.** If `buildScmEndpoint` throws (transient DNS, peer not yet accepting on the new IP, NetUtils refusing the address), the stale endpoint stays registered. Otherwise the peer would disappear from `scmMachines` entirely and no heartbeat could recover it. Tested by `TestSCMConnectionManager.testRefreshSCMServerLeavesStaleEndpointOnBuildFailure` using a `@VisibleForTesting` overridable `buildScmEndpoint` hook. - **Refuse swaps that collide with another registered peer key.** If transient kube-DNS returns peer-B's IP for peer-A's hostname, the swap is refused rather than overwriting peer-B's endpoint. Without this, peer-B's `EndpointStateMachine` would be silently replaced, leaking its executor and orphaning its task thread. - **Re-check after DNS lookup.** A concurrent `removeSCMServer` or refresh may have raced ahead while we were resolving. The write-lock phase verifies the snapshot is still current before swapping. - **`close()` outside the lock.** Stale-endpoint teardown blocks on `RPC.stopProxy`; holding `writeLock()` across that would stall every concurrent heartbeat / reconfiguration. ### 3. `StateContext.migrateEndpoint` — preserve in-flight reports across swap Per-endpoint queues (`incrementalReportsQueue`, `containerActions`, `pipelineActions`, `isFullReportReadyToBeSent`) are keyed by `InetSocketAddress`. Without migration, a swap would orphan all queued reports for that peer. The migration is ordered to preserve the invariant **"every endpoint in `endpoints` has a queue at every observable point"**: 1. **PUBLISH** — install new-key queues alongside the old-key queues. 2. **SWITCH** — add `newEndpoint` to the endpoints set; remove `oldEndpoint` from the endpoints set. 3. **RETIRE** — drop the old-key queues (no producer can reach them after step 2). `endpoints` is now a `CopyOnWriteArraySet` (was `HashSet`). `incrementalReportsQueue`, `containerActions`, `pipelineActions`, and `isFullReportReadyToBeSent` are now `ConcurrentHashMap` (some already were). Producers null-skip queue lookups as defense-in-depth — a producer racing migration MUST NOT NPE on a concurrent `remove`. The full-report flags get a special case: a swapped endpoint is effectively a fresh peer (the new SCM pod has no idea which reports we have already shipped), so its `isFullReportReadyToBeSent[type]` flags are seeded **fresh** rather than copied from the old key. Tested in `TestHeartbeatEndpointTaskDnsRefresh`. ### 4. `HeartbeatEndpointTask` trigger In the heartbeat catch block, after `logIfNeeded(ex)`: ```java if (resolveOnFailureEnabled // ozone.client.failover.resolve-needed && missedCount >= refreshThreshold // ozone.datanode.scm.heartbeat.address.refresh.threshold && ConnectionFailureUtils.isConnectionFailure(ex) && hostAndPort != null) { maybeRefreshScmAddress(); // calls SCMConnectionManager.refreshSCMServer } ``` All four gates are required. Application-level errors don't trigger refresh. Endpoints without a preserved hostname (legacy code path) don't trigger. The threshold prevents over-reaction to a one-off blip. ### 5. New config knob `ozone.datanode.scm.heartbeat.address.refresh.threshold` (default **3**). Conservative default — at the typical 30-second heartbeat interval and 6-second `socketTimeout`, this means at most ~108 seconds of dialing the stale IP before the first DNS retry. In practice the failures are usually fast (TCP RST or routing failure), so the recovery is much faster. ## Real-world failure shapes this fix targets Two distinct failure modes drove the requirement: - **AWS EC2 / EKS — silent packet drop.** When a DN attempts to connect to the cached IP of `scm-0` after the pod has moved, AWS silently drops the packet. The TCP retry loop expires after `socketTimeout` (default 6 seconds in Ozone). Without this PR, the DN retries the same dead IP forever. With this PR, after `threshold` consecutive `SocketTimeoutException`s, the DN re-resolves DNS and swaps to the new IP. - **OpenStack — TCP RST or ICMP unreachable.** The network stack fast-rejects packets to the dead IP, surfacing as `ConnectException`. Same recovery path: after `threshold` consecutive failures, refresh. ## How was this patch tested? | Test class | Count | Coverage | |---|---|---| | `TestSCMConnectionManager` (extended) | 7 (1 prior + 6 new) | `resolveLatestAddress` edge cases. `refreshSCMServer` happy-path swap. No-op when `hostAndPort` not preserved. **Rollback regression**: when `buildScmEndpoint` throws, the stale endpoint remains registered (uses `@VisibleForTesting` overridable hook to inject the failure). | | `TestHeartbeatEndpointTaskDnsRefresh` (new) | 6 | Production trigger chain. `HeartbeatEndpointTask.call()` catch block fires `refreshSCMServer` only when (a) flag enabled, (b) threshold met, (c) cause is connection-class, (d) `hostAndPort` preserved. `AccessControlException` at threshold does NOT trigger. After a successful swap, `StateContext`'s incremental-reports map has the new key and not the old key. | | `TestSCMConnectionManagerDnsRefreshE2E` (new) | 1 (`@Timeout(30)`) | Real-RPC swap mechanism. Stands up a real `ScmTestMock` RPC server on a loopback OS-assigned port, primes the connection manager with a stale `127.0.0.99` cache + preserved `localhost:port`, calls `refreshSCMServer`, asserts a real `sendHeartbeat` round-trips through the swapped endpoint. Lives in `hadoop-hdds/server-scm` because it depends on `ScmTestMock`. | Existing regression suite verified non-regressed: `TestEndPoint` (17), `TestHeartbeatEndpointTask` (8). ## Scope and known limitations - **DN initial bringup with stale DNS**: the refresh fires from the `HEARTBEAT` phase via `HeartbeatEndpointTask`. If a DN starts up with the SCM peer already at a stale IP and never reaches `HEARTBEAT`, the recovery path does not engage. Initial-bringup DNS staleness is the existing concern of [HDDS-5919](https://issues.apache.org/jira/browse/HDDS-5919)'s `ozone.network.jvm.address.cache.enabled=false`. `InitDatanodeState.java` already postpones initialization on initial-resolution failure. - **HDFS-14118-style construction-time DNS fan-out** (one hostname → multiple persistent IPs, for round-robin DNS HA) is a different problem and out of scope. Worth a follow-on JIRA if needed. ## What is the link to the Apache JIRA? https://issues.apache.org/jira/browse/HDDS-15514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
