Ritesh Shukla created HDDS-15514:
------------------------------------
Summary: Datanode and OzoneManager fail to recover from SCM peer
IP changes; cache stale InetSocketAddress for process lifetime
Key: HDDS-15514
URL: https://issues.apache.org/jira/browse/HDDS-15514
Project: Apache Ozone
Issue Type: Bug
Components: sum, OM, Ozone Client, Ozone Datanode
Affects Versions: 2.1.0
Reporter: Ritesh Shukla
Assignee: Ritesh Shukla
h1. Problem
In Kubernetes (and any environment where peer pod IPs may change while DNS
names remain stable), Apache Ozone DataNodes and OzoneManagers can become
permanently disconnected from SCM after an SCM peer pod is rescheduled to a new
IP. The DataNode/OM process remains alive but its heartbeats and RPC calls keep
dialing the now-defunct IP forever. The only known recoveries today are (a)
restart the DataNode/OM process or (b) deploy an external operator that watches
SCM pod IPs and force-restarts dependent components.
This is the same class of bug that {{HADOOP-17068}} fixed for HDFS NameNode HA
in Hadoop 3.4.0. The intent has been encoded in Hadoop common's
{{SecurityUtil.getByName}} javadoc since 2.x:
{quote}
4) ... if the host is re-resolved, ex. during a connection re-attempt, that a
reverse lookup to host and forward lookup to IP is not performed since the
reverse/forward mappings may not always return the same IP. {quote}
But across Ozone's five inter-component RPC paths, that property does not hold
today.
h1. Failure modes observed in production
The bug presents differently depending on what the network does with packets to
the stale IP:
h3. AWS EC2 / EKS (silent packet drop) — long stalls, no recovery
When the cached IP belonged to an EC2 ENI that has since been released, the AWS
VPC silently drops packets destined for that IP. The DataNode's TCP SYN sits in
the kernel waiting for a SYN-ACK that never comes. Each connection attempt
consumes the full {{ipc.client.connect.timeout}} (default 20s) plus
{{ipc.client.connect.retry.count}} retries. The {{Client.updateAddress()}}
re-resolution in HADOOP-17068 is gated on {{IOException}} from
{{setupConnection}} — but on a silent drop, the exception only fires after the
full timeout chain. Across 3 SCMs in HA round-robin, hours can pass with the
DataNode process alive but completely decoupled from the cluster. Not a single
fresh DNS query is made because the existing {{InetSocketAddress}} in
{{EndpointStateMachine}} is never reconstr
h3. OpenStack / on-prem with TCP RST or ICMP Unreachable — fast crash loop
When the network actively rejects packets, the DataNode cycles through all
tiseconds. Hitting the maximum failover retry limit across all proxies
sequentially can cause the DataNode's heartbeat service to throw a fatal excn.
On restart, the DataNode builds a fresh {{SCMConnectionManager}} which forces a
fresh DNS resolution. So in this configuration, the bug self-heals at the cost
of unscheduled DataNode crashes.
In both cases, the underlying defect is the same: long-lived
{{InetSocketAddrozen at process startup are never rebuilt.
h1. Root cause
{{InetSocketAddress(host, port)}} performs a one-shot DNS lookup at construcIP
for the object's lifetime. Apache Ozone's failover proxy providers and DataNode
connection manager all construct {{InetSocketAddress}} objects oncedefinitely
as map keys, RPC proxy targets, and final fields:
|| Path || Owner of the frozen address || Where it's built ||
| DN → SCM heartbeat | {{EndpointStateMachine.address}} (final) |
{{InitDatanodeState.java:104}} → {{SCMConnectionManager.addSCMServer}} (lines
133-174) |
| OM → SCM (block + container) | {{SCMProxyInfo.rpcAddr}} (final) |
{{SCMFailoverProxyProviderBase.loadConfigs}} (line 148) |
| Client → OM (Hadoop RPC) | {{OMProxyInfo.rpcAddr}} (final) |
{{OMFailoverPadoopRpcOMFailoverProxyProvider.initOmProxiesFromConfigs}} |
| OM ↔ OM control plane | (uses {{OMFailoverProxyProvider}} machinery — same
shape as Client → OM) | — |
| OM ↔ OM Ratis replication | {{RaftPeer.address}} (final String) — built
fr.getInetAddress(), ratisPort)}} |
{{OzoneManagerRatisServer.createRaftPeer}}(lines 438-451, 459-474) |
| SCM ↔ SCM Ratis replication | {{RaftPeer.address}} (final String) — alreadost
paths | {{SCMRatisServerImpl.buildRaftGroup}} (lines 399-414) |
Notably, {{HDDS-5919}}'s {{ozone.network.jvm.address.cache.enabled=false}} ot
only affects the JVM's positive DNS cache TTL, which would help future
{{NetUtils.createSocketAddr}} calls. But the heartbeat/RPC path never makes
ress}} is final on each
{{EndpointStateMachine}}/{{SCMProxyInfo}}/{{OMProxyInfo}}.
h1. Proposed fix
Mirror the HADOOP-17068 design pattern at the {{FailoverProxyProvider}} / {{in
Ozone (one tier above where Hadoop applied the fix, because Ozone's seams live
there). On each connection-class failure:
# Re-resolve the configured hostname via {{NetUtils.createSocketAddr(hostnam
# Compare the new resolved IP against the cached
{{InetSocketAddress.getAddress()}}.
# If changed: stop the cached RPC proxy, replace the cached address atomicalet
the next retry build a fresh proxy via the existing creation path.
# If unchanged: fall through to existing retry behavior (no-op).
This requires preserving the original "host:port" config string alongside ths}}
(currently only the resolved address is kept). The implementation adds an
opt-in config flag mirroring HBase's {{hbase.resolve.hostnames.on.failure}}
nt.failover.resolve-needed}}:
{noformat} ozone.client.failover.resolve-needed = false (default)
ozone.datanode.scm.heartbeat.address.refresh.threshold = 3 (default; DN-sp
{noformat}
When the flag is false, behavior is byte-identical to current master. Operat
existing non-K8s deployments see zero behavior change. This matches the HBase /
HADOOP-17068 precedent of requiring explicit operator opt-in for the
h2. Per-path summary
|| Path || Mechanism ||
| DN → SCM | {{EndpointStateMachine}} preserves {{hostAndPort}} string.
{{HeartbeatEndpointTask}} catch block calls {{maybeRefreshScmAddress}} when
{{missedCount}} >= threshold. {{SCMConnectionManager.refreshSCMServer}} swaps
the endpoint atomically; theVERSION}} state, which is the correct behavior
because a peer that has beenrescheduled is effectively a fresh process. |
| OM → SCM | {{SCMProxyInfo}} retains the config-time host:port.
{{SCMFailovoxyAddressIfChanged(nodeId)}} runs in {{shouldRetry}} when an
{{IOException}} chain contains {{ConnectException}},
{{NoRouteToHostException}}, or {{UnknownHostException}}. Stale proxy is stopped
via {{RPC.stopProxy}}. |
| Client → OM (HadoopRPC) | {{OMProxyInfo.rpcAddr}} becomes mutable behind
thAddressIfChanged()}} re-resolves {{rpcAddrStr}}, swaps {{rpcAddr}} and
thederived {{dtService}}, nulls the cached proxy so the next
{{createProxyIfNeeded}} dials the new IP.
{{OMFailoverProxyProviderBase.shouldRetry}} calls this on connection-class
exceptions before advancing the failover index. |
| Client → OM (gRPC) | No code change required.
{{GrpcOMFailoverProxyProvideetSocketAddress(0)}} and lets gRPC's
{{NameResolver}} re-resolve hostnames on
its own schedule. |
| OM ↔ OM control plane | Uses Hadoop RPC via {{OMInterServiceProtocol}}, noy
via the Client → OM fix. |
| OM ↔ OM Ratis replication | {{OzoneManagerRatisServer.createRaftPeer}}
simname:port string to {{RaftPeer.setAddress}} — never a resolved IP. Two of
three previous {{createRaftPeer}} branches were calling {{new
InetSocketAddrratisPort)}}, which strips the hostname and freezes the IP. With
hostname-only addresses, gRPC's default {{DnsNameResolver}} (used by Ratis u
connection failure / on its own refresh schedule. No Ratis upstream change
required. | | SCM ↔ SCM Ratis replication | Already uses hostname strings;
removes a mis use IP instead of hostname??}} comment in
{{SCMRatisServerImpl.buildRaftGroup}} and {{SCMHAManagerImpl}} and replaces
h1. Connection-class exception filter
The refresh path is gated on exception types where DNS re-resolution could p
* {{java.net.ConnectException}} — connection refused / unreachable
* {{java.net.NoRouteToHostException}} — host route gone
* {{java.net.UnknownHostException}} — DNS lookup failed downstream
Filtering excludes application-level errors ({{OMNotLeaderException}},
{{Ret{{AccessControlException}}) where SCM/OM is reachable on the cached IP and
the failure is logical, not network. This avoids triggering DNS load on ever
h1. Testing
13 new unit tests + 1 real-RPC integration test, all passing under {{mvn clean
test}} on the latest master:
|| Test class || Tests || What it covers ||
| {{TestSCMConnectionManager}} | +5 new | {{resolveLatestAddress}} edge cases,
{{refreshSCMServer}} happy-path swap, no-op when IP unchanged, no-op when
host:port not preserved (legacy ctor path) |
| {{TestSCMFailoverProxyProviderRefresh}} | 3 new | Swap on IP change,
no-op{hostAndPort}} not preserved |
| {{TestOMProxyInfoDnsRefresh}} | 3 new | Address swap, dtService update, proxy
null-out, proxy rebuild after refresh |
| {{TestSCMConnectionManagerDnsRefreshE2E}} | 1 new | Real Hadoop RPC server
(via {{ScmTestMock}}) on a real loopback socket. Connection manager primed with
a deliberately stale {{127.0.0.99}} and preserved hostname {{localhost:port}}.
{{refreshSCMServer}} fires; a real {{sendHeartbeat}} round-trips to the live
server; {{ScmTestMock.rpcCount}} increments. Proves the full chain: address
swap → fresh RPC proxy → real socket dial → s. |
| {{TestOzoneManagerRatisServer}} | +1 new | Asserts {{RaftPeer.getAddress()}}
is a hostname:port string, never an IP:port string. Defensive regex check that
the host portion is not a numeric IPv4. |
Existing regression tests covered by the same module set
({{TestSCMConnectionManager}} 1 prior, {{TestEndPoint}} 17,
{{TestOMFailoverProxyProvider}} 8, {{TestOMFailovers}} 1,
{{TestOzoneManagerRatisServer}} 5 prior): all green.
A docker-compose validation run with the {{ozone-ha}} stack confirmed the new
code is wired into the runtime JAR ({{javap}} verified
{{addSCMServer(InetSocketAddress, String, String)}} and
{{refreshSCMServer(InetSocketAddress, String)}} are present in
{{hdds-container-service-2.2.0-SNAPSHOT.jar}}), the config flag reaches the
DN's process environment, and the cluster boots and processes writes
successfully under the opt-in flag.
h1. Scope and known limitations
* The fix only fires from the {{HEARTBEAT}} phase via
{{HeartbeatEndpointTask}}. If a DataNode starts up with the SCM peer already at
a stale IP (DN never reaches {{HEARTBEAT}}), the recovery path does not engage.
Initial-bringup DNS staleness is the
existing{ozone.network.jvm.address.cache.enabled=false}}.{{InitDatanodeState.java:94-101}}
already postpones initialization on initial-resolution failure.
* HDFS-14118-style construction-time DNS fan-out (one hostname → multiple
persistent IPs) is a different problem (round-robin DNS for HDFS HA) and out of
scope here. Worth a follow-on JIRA if Ozone deployments need it.
* The Ratis quorum-loss exit-0 issue ({{SCMStateMachine.close()}} calling {{
when leader election fails to converge, leading to Kubernetes CrashLoopBackOff
death spirals) is a separate concern. File as {{HDDS-XXXXX}it non-zero so K8s'
standard restart handling becomes correct again.
h1. Suggested sub-task breakdown
# {{HDDS-XXXX1}}: DN → SCM heartbeat —
{{EndpointStateMachine}}/{{SCMConnectionManager}}/{{HeartbeatEndpointTask}}
(smallest blast radius; lands first to prove the pattern) #
{{HDDS-XXXX2}}: OM → SCM — {{SCMFailoverProxyProviderBase}}/{{SCMProxyInfo
# {{HDDS-XXXX3}}: Client → OM — {{OMFailoverProxyProviderBase}}/{{OMProxyInfo}}
# {{HDDS-XXXX4}}: Ratis hostnames-only —
{{OzoneManagerRatisServer.createRaftPeer}}, {{SCMRatisServerImpl}},
{{SCMHAManagerImpl}}, {{NodeDetails}} (smallest, lowest-risk; could lfirst to
reduce surface)
# {{HDDS-XXXX5}}: New config keys + {{ozone-default.xml}} entries (could fold
into XXXX1)
Each sub-task is independently testable and revertable.
Umbrella ticket citeeview context.
h1. References
* {{HADOOP-17068}}: client fails forever when namenode ipaddr changed (Hadoop
3.4.0). Commit {{fa14e4bc001e28d9912e8d985d09bab75aedb87c}}. Authors: Sean
Chow, He Xiaoqiao. Touche{{Client.setupConnection}} only.
* {{HDFS-14118}}: introduces {{dfs.client.failover.resolve-needed}} and the
{{AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary}} hook
(different shape: construction-time fan-out, not per-failure
refresh).
* {{HBASE}}: {{hbase.resolve.hostnames.on.failure}}
({{ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY}}) — same opt-in
design philosophy. * {{ZOOKEEPER-1506}},
{{ZOOKEEPER-2982}}: ZooKeeper {{StaticHostProvider}} —}} call. The cleanest
design in the Hadoop-adjacent ecosystem; closer to"always on" than opt-in.
* {{HDDS-5919}}: introduces {{ozone.network.jvm.address.cache.enabled}} (defe
JVM-level positive DNS cache TTL but does not fix the
long-lived{{InetSocketAddress}} instances.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]