Ritesh Shukla created HDDS-15514:
------------------------------------

             Summary: Datanode and OzoneManager fail to recover from SCM peer 
IP changes; cache stale InetSocketAddress for process lifetime
                 Key: HDDS-15514
                 URL: https://issues.apache.org/jira/browse/HDDS-15514
             Project: Apache Ozone
          Issue Type: Bug
          Components: sum, OM, Ozone Client, Ozone Datanode
    Affects Versions: 2.1.0
            Reporter: Ritesh Shukla
            Assignee: Ritesh Shukla


h1. Problem

In Kubernetes (and any environment where peer pod IPs may change while DNS 
names remain stable), Apache Ozone DataNodes and OzoneManagers can become 
permanently disconnected from SCM after an SCM peer pod is rescheduled to a new 
IP. The DataNode/OM process remains alive but its heartbeats and RPC calls keep 
dialing the now-defunct IP forever. The only known recoveries today are (a) 
restart the DataNode/OM process or (b) deploy an external operator that watches 
SCM pod IPs and force-restarts dependent components.

This is the same class of bug that {{HADOOP-17068}} fixed for HDFS NameNode HA 
in Hadoop 3.4.0. The intent has been encoded in Hadoop common's 
{{SecurityUtil.getByName}} javadoc since 2.x:

{quote}
4) ... if the host is re-resolved, ex. during a connection re-attempt, that a 
reverse lookup to host and forward lookup to IP is not performed since the 
reverse/forward mappings may not always return the same IP. {quote}

But across Ozone's five inter-component RPC paths, that property does not hold 
today.

h1. Failure modes observed in production

The bug presents differently depending on what the network does with packets to 
the stale IP:

h3. AWS EC2 / EKS (silent packet drop) — long stalls, no recovery

When the cached IP belonged to an EC2 ENI that has since been released, the AWS 
VPC silently drops packets destined for that IP. The DataNode's TCP SYN sits in 
the kernel waiting for a SYN-ACK that never comes. Each connection attempt 
consumes the full {{ipc.client.connect.timeout}} (default 20s) plus 
{{ipc.client.connect.retry.count}} retries. The {{Client.updateAddress()}} 
re-resolution in HADOOP-17068 is gated on {{IOException}} from 
{{setupConnection}} — but on a silent drop, the exception only fires after the 
full timeout chain. Across 3 SCMs in HA round-robin, hours can pass with the 
DataNode process alive but completely decoupled from the cluster. Not a single 
fresh DNS query is made because the existing {{InetSocketAddress}} in 
{{EndpointStateMachine}} is never reconstr

h3. OpenStack / on-prem with TCP RST or ICMP Unreachable — fast crash loop

When the network actively rejects packets, the DataNode cycles through all 
tiseconds. Hitting the maximum failover retry limit across all proxies 
sequentially can cause the DataNode's heartbeat service to throw a fatal excn. 
On restart, the DataNode builds a fresh {{SCMConnectionManager}} which forces a 
fresh DNS resolution. So in this configuration, the bug self-heals  at the cost 
of unscheduled DataNode crashes.

In both cases, the underlying defect is the same: long-lived 
{{InetSocketAddrozen at process startup are never rebuilt.

h1. Root cause

{{InetSocketAddress(host, port)}} performs a one-shot DNS lookup at construcIP 
for the object's lifetime. Apache Ozone's failover proxy providers and DataNode 
connection manager all construct {{InetSocketAddress}} objects oncedefinitely 
as map keys, RPC proxy targets, and final fields:

|| Path || Owner of the frozen address || Where it's built ||
| DN → SCM heartbeat | {{EndpointStateMachine.address}} (final) | 
{{InitDatanodeState.java:104}} → {{SCMConnectionManager.addSCMServer}} (lines 
133-174) |
| OM → SCM (block + container) | {{SCMProxyInfo.rpcAddr}} (final) | 
{{SCMFailoverProxyProviderBase.loadConfigs}} (line 148) |
| Client → OM (Hadoop RPC) | {{OMProxyInfo.rpcAddr}} (final) | 
{{OMFailoverPadoopRpcOMFailoverProxyProvider.initOmProxiesFromConfigs}} |
| OM ↔ OM control plane | (uses {{OMFailoverProxyProvider}} machinery — same 
shape as Client → OM) | — |
| OM ↔ OM Ratis replication | {{RaftPeer.address}} (final String) — built 
fr.getInetAddress(), ratisPort)}} | 
{{OzoneManagerRatisServer.createRaftPeer}}(lines 438-451, 459-474) |
| SCM ↔ SCM Ratis replication | {{RaftPeer.address}} (final String) — alreadost 
paths | {{SCMRatisServerImpl.buildRaftGroup}} (lines 399-414) |

Notably, {{HDDS-5919}}'s {{ozone.network.jvm.address.cache.enabled=false}} ot 
only affects the JVM's positive DNS cache TTL, which would help future
{{NetUtils.createSocketAddr}} calls. But the heartbeat/RPC path never makes 
ress}} is final on each
{{EndpointStateMachine}}/{{SCMProxyInfo}}/{{OMProxyInfo}}.

h1. Proposed fix

Mirror the HADOOP-17068 design pattern at the {{FailoverProxyProvider}} / {{in 
Ozone (one tier above where Hadoop applied the fix, because Ozone's seams live 
there). On each connection-class failure:

# Re-resolve the configured hostname via {{NetUtils.createSocketAddr(hostnam
# Compare the new resolved IP against the cached 
{{InetSocketAddress.getAddress()}}.
# If changed: stop the cached RPC proxy, replace the cached address atomicalet 
the next retry build a fresh proxy via the existing creation path.
# If unchanged: fall through to existing retry behavior (no-op).

This requires preserving the original "host:port" config string alongside ths}} 
(currently only the resolved address is kept). The implementation adds an 
opt-in config flag mirroring HBase's {{hbase.resolve.hostnames.on.failure}} 
nt.failover.resolve-needed}}:

{noformat} ozone.client.failover.resolve-needed = false   (default) 
ozone.datanode.scm.heartbeat.address.refresh.threshold = 3   (default; DN-sp 
{noformat}

When the flag is false, behavior is byte-identical to current master. Operat 
existing non-K8s deployments see zero behavior change. This matches the HBase / 
HADOOP-17068 precedent of requiring explicit operator opt-in for the

h2. Per-path summary

|| Path || Mechanism ||
| DN → SCM | {{EndpointStateMachine}} preserves {{hostAndPort}} string. 
{{HeartbeatEndpointTask}} catch block calls {{maybeRefreshScmAddress}} when 
{{missedCount}} >= threshold. {{SCMConnectionManager.refreshSCMServer}} swaps 
the endpoint atomically; theVERSION}} state, which is the correct behavior 
because a peer that has beenrescheduled is effectively a fresh process. |
| OM → SCM | {{SCMProxyInfo}} retains the config-time host:port. 
{{SCMFailovoxyAddressIfChanged(nodeId)}} runs in {{shouldRetry}} when an 
{{IOException}} chain contains {{ConnectException}}, 
{{NoRouteToHostException}}, or {{UnknownHostException}}. Stale proxy is stopped 
via {{RPC.stopProxy}}. |
| Client → OM (HadoopRPC) | {{OMProxyInfo.rpcAddr}} becomes mutable behind 
thAddressIfChanged()}} re-resolves {{rpcAddrStr}}, swaps {{rpcAddr}} and 
thederived {{dtService}}, nulls the cached proxy so the next 
{{createProxyIfNeeded}} dials the new IP. 
{{OMFailoverProxyProviderBase.shouldRetry}} calls this on connection-class
exceptions before advancing the failover index. |
| Client → OM (gRPC) | No code change required. 
{{GrpcOMFailoverProxyProvideetSocketAddress(0)}} and lets gRPC's 
{{NameResolver}} re-resolve hostnames on
its own schedule. |
| OM ↔ OM control plane | Uses Hadoop RPC via {{OMInterServiceProtocol}}, noy 
via the Client → OM fix. |
| OM ↔ OM Ratis replication | {{OzoneManagerRatisServer.createRaftPeer}} 
simname:port string to {{RaftPeer.setAddress}} — never a resolved IP. Two of 
three previous {{createRaftPeer}} branches were calling {{new 
InetSocketAddrratisPort)}}, which strips the hostname and freezes the IP. With 
hostname-only addresses, gRPC's default {{DnsNameResolver}} (used by Ratis u 
connection failure / on its own refresh schedule. No Ratis upstream change 
required. | | SCM ↔ SCM Ratis replication | Already uses hostname strings; 
removes a mis use IP instead of hostname??}} comment in 
{{SCMRatisServerImpl.buildRaftGroup}} and {{SCMHAManagerImpl}} and replaces

h1. Connection-class exception filter

The refresh path is gated on exception types where DNS re-resolution could p

* {{java.net.ConnectException}} — connection refused / unreachable
* {{java.net.NoRouteToHostException}} — host route gone
* {{java.net.UnknownHostException}} — DNS lookup failed downstream

Filtering excludes application-level errors ({{OMNotLeaderException}}, 
{{Ret{{AccessControlException}}) where SCM/OM is reachable on the cached IP and 
the failure is logical, not network. This avoids triggering DNS load on ever

h1. Testing

13 new unit tests + 1 real-RPC integration test, all passing under {{mvn clean 
test}} on the latest master:

|| Test class || Tests || What it covers ||
| {{TestSCMConnectionManager}} | +5 new | {{resolveLatestAddress}} edge cases, 
{{refreshSCMServer}} happy-path swap, no-op when IP unchanged, no-op when 
host:port not preserved (legacy ctor path) |
| {{TestSCMFailoverProxyProviderRefresh}} | 3 new | Swap on IP change, 
no-op{hostAndPort}} not preserved |
| {{TestOMProxyInfoDnsRefresh}} | 3 new | Address swap, dtService update, proxy 
null-out, proxy rebuild after refresh |
| {{TestSCMConnectionManagerDnsRefreshE2E}} | 1 new | Real Hadoop RPC server 
(via {{ScmTestMock}}) on a real loopback socket. Connection manager primed with 
a deliberately stale {{127.0.0.99}} and preserved hostname {{localhost:port}}. 
{{refreshSCMServer}} fires; a real {{sendHeartbeat}} round-trips to the live 
server; {{ScmTestMock.rpcCount}} increments. Proves the full chain: address 
swap → fresh RPC proxy → real socket dial → s. |
| {{TestOzoneManagerRatisServer}} | +1 new | Asserts {{RaftPeer.getAddress()}} 
is a hostname:port string, never an IP:port string. Defensive regex check that 
the host portion is not a numeric IPv4. |

Existing regression tests covered by the same module set 
({{TestSCMConnectionManager}} 1 prior, {{TestEndPoint}} 17, 
{{TestOMFailoverProxyProvider}} 8, {{TestOMFailovers}} 1, 
{{TestOzoneManagerRatisServer}} 5 prior): all green.

A docker-compose validation run with the {{ozone-ha}} stack confirmed the new 
code is wired into the runtime JAR ({{javap}} verified 
{{addSCMServer(InetSocketAddress, String, String)}} and 
{{refreshSCMServer(InetSocketAddress, String)}} are present in 
{{hdds-container-service-2.2.0-SNAPSHOT.jar}}), the config flag reaches the 
DN's process environment, and the cluster boots and processes writes 
successfully under the opt-in flag.

h1. Scope and known limitations

* The fix only fires from the {{HEARTBEAT}} phase via 
{{HeartbeatEndpointTask}}. If a DataNode starts up with the SCM peer already at 
a stale IP (DN never reaches {{HEARTBEAT}}), the recovery path does not engage. 
Initial-bringup DNS staleness is the 
existing{ozone.network.jvm.address.cache.enabled=false}}.{{InitDatanodeState.java:94-101}}
 already postpones initialization on initial-resolution failure.

* HDFS-14118-style construction-time DNS fan-out (one hostname → multiple 
persistent IPs) is a different problem (round-robin DNS for HDFS HA) and out of 
scope here. Worth a follow-on JIRA if Ozone deployments need it.

* The Ratis quorum-loss exit-0 issue ({{SCMStateMachine.close()}} calling {{ 
when leader election fails to converge, leading to Kubernetes CrashLoopBackOff 
death spirals) is a separate concern. File as {{HDDS-XXXXX}it non-zero so K8s' 
standard restart handling becomes correct again.

h1. Suggested sub-task breakdown

# {{HDDS-XXXX1}}: DN → SCM heartbeat — 
{{EndpointStateMachine}}/{{SCMConnectionManager}}/{{HeartbeatEndpointTask}} 
(smallest blast radius; lands first to prove the pattern)      # 
{{HDDS-XXXX2}}: OM → SCM — {{SCMFailoverProxyProviderBase}}/{{SCMProxyInfo
# {{HDDS-XXXX3}}: Client → OM — {{OMFailoverProxyProviderBase}}/{{OMProxyInfo}}
# {{HDDS-XXXX4}}: Ratis hostnames-only — 
{{OzoneManagerRatisServer.createRaftPeer}}, {{SCMRatisServerImpl}}, 
{{SCMHAManagerImpl}}, {{NodeDetails}} (smallest, lowest-risk; could lfirst to 
reduce surface)
# {{HDDS-XXXX5}}: New config keys + {{ozone-default.xml}} entries (could fold 
into XXXX1)
                                                                                
                                                                                
                  Each sub-task is independently testable and revertable. 
Umbrella ticket citeeview context.

h1. References
* {{HADOOP-17068}}: client fails forever when namenode ipaddr changed (Hadoop 
3.4.0). Commit {{fa14e4bc001e28d9912e8d985d09bab75aedb87c}}. Authors: Sean 
Chow, He Xiaoqiao. Touche{{Client.setupConnection}} only.
* {{HDFS-14118}}: introduces {{dfs.client.failover.resolve-needed}} and the 
{{AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary}} hook 
(different shape:            construction-time fan-out, not per-failure 
refresh).
* {{HBASE}}: {{hbase.resolve.hostnames.on.failure}} 
({{ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY}}) — same opt-in 
design philosophy.                                  * {{ZOOKEEPER-1506}}, 
{{ZOOKEEPER-2982}}: ZooKeeper {{StaticHostProvider}} —}} call. The cleanest 
design in the Hadoop-adjacent ecosystem; closer to"always on" than opt-in.
* {{HDDS-5919}}: introduces {{ozone.network.jvm.address.cache.enabled}} (defe 
JVM-level positive DNS cache TTL but does not fix the 
long-lived{{InetSocketAddress}} instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to