[jira] [Updated] (HBASE-30245) RPC connection pinned to stale IP after cross-pod IP reuse: NSRE storm persists indefinitely because pooled channel is reused without re-resolving DNS

samad (Jira) Mon, 22 Jun 2026 01:10:09 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-30245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


samad updated HBASE-30245:
--------------------------
    Description: 
Summary: Under a narrow but real Kubernetes condition — a RegionServer pod's IP 
is reassigned to a different live pod (sometimes a RegionServer of an entirely 
different HBase cluster) — the async client gets stuck issuing requests to that 
wrong-but-live server and receives a continuous stream of 
{{NotServingRegionException}} (NSRE) for the affected regions. The condition 
does not self-heal: only a client process restart fixes it.

We run HBase on Kubernetes, where pod cohost both a RegionServer and a 
DataNode. Multiple independent HBase clusters share the same Kubernetes 
environment.

We observed a failure scenario during node maintenance where an HBase async 
client can become permanently stuck talking to the wrong RegionServer after 
Kubernetes pod IP reuse.

Consider the following example:
 * *Pod A* hosts *RegionServer A* belonging to {*}HBase Cluster A{*}.
 * *Pod B* hosts *RegionServer B* belonging to {*}HBase Cluster B{*}.
 * Both pods are running on the same Kubernetes node.

During a maintenance activity (node reboot, drain, upgrade, etc.), all pods on 
the node restart.

A possible sequence is:
 # Pod A goes down.
 # Kubernetes later reassigns Pod A's old IP address to Pod B.
 # The client already has an established TCP connection to Pod A's old IP.
 # Because the connection remains alive through the networking/service-mesh 
layer, the client does not see a transport failure.
 # Requests intended for RegionServer A are now delivered to RegionServer B, 
which belongs to a completely different HBase cluster and has never hosted the 
requested regions.
 # RegionServer B correctly responds with {{{}NotServingRegionException 
(NSRE){}}}.
 # The client continues reusing the same underlying RPC connection and it gets 
following continuously 

2026-06-09T06:03:15.226Z, org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: 
app_ns:my-table,rowprefix-0001,1761217611582.65e37b957f2a12c0c710d3866bece520. 
is not online on 
hbase-B-dn-4.hbase-B.k8s-namespace.svc.cluster.local,16020,1780977222833
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3552)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3530)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1486)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2972)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44994)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
 
Note: The table `app_ns:my-table` belongs to HBase cluster-A. The RegionServer 
named in the NSRE (`hbase-B-dn-4`) belongs to a completely different HBase 
cluster 
(cluster-B). This table has never existed on cluster-B. The client is reaching 
cluster-B's RegionServer only because cluster-B's pod acquired the IP address 
that 
previously belonged to cluster-A's pod (hbase-A-dn-7). After a Kubernetes node 
maintenance event, both pods restarted on the same node. Cluster-B's pod came 
up 
first and was assigned cluster-A's old pod IP. The client's existing TCP 
channel 
was pinned to that IP and never re-resolved DNS, so all RPCs intended for 
cluster-A's RegionServer are landing on cluster-B's RegionServer instead.
 # Requests continue hitting RegionServer B and receive NSREs indefinitely.

In production, we observed this condition persisting for approximately an hour 
and generating tens of thousands of NSREs. Recovery occurred only after 
restarting the hbase client process

*Observed Behavior*
 * Continuous NSREs for the same regions.
 * The same RegionServer appears in all NSRE responses.
 * The responding RegionServer belongs to a different HBase cluster than the 
target table.
 * No transport errors or connection failures are observed.
 * Client restart immediately restores normal operation.

*Expected Behavior*

When the client receives repeated NSREs from a RegionServer that does not match 
the expected destination, it should eventually drop the existing connection and 
establish a fresh one, allowing DNS re-resolution and recovery without 
requiring a client restart.
h3. Problem Identified

The HBase async client's {{NettyRpcConnection}} resolves DNS exactly once — 
when the channel is first created — and never re-checks it for the lifetime of 
that channel. If the underlying IP changes (e.g., Kubernetes pod IP reuse), the 
channel remains pinned to the old (now wrong) IP indefinitely. An NSRE is an 
application-level response and does not trigger channel closure, so the client 
never gets a chance to re-resolve DNS.
h4. 1. DNS resolution happens only once per channel, inside {{connect()}}

While {{{}channel != null{}}}, the existing channel is reused indefinitely and 
DNS is never re-checked.
{code:java|title=NettyRpcConnection.java — sendRequest0()}
@Override
public void run(boolean cancelled) throws IOException {
    if (cancelled) {
        setCancelled(call);
    } else {
        if (channel == null) {       // ← ONLY path to DNS resolution
            connect();
        }
        scheduleTimeoutTask(call);
        NettyFutureUtils.addListener(channel.writeAndFlush(call), new 
ChannelFutureListener() {
            @Override
            public void operationComplete(ChannelFuture future) throws 
Exception {
                if (!future.isSuccess()) {
                    call.setException(toIOE(future.cause()));
                }
            }
        });
    }
}
{code}
h4. 2. NSRE does not close the channel

The {{channel = null}} reset lives only in {{{}shutdown0(){}}}, which is 
triggered by transport failures ({{{}channelInactive{}}}, 
{{{}exceptionCaught{}}}) or explicit shutdown — never by an application-level 
NSRE response:
{code:java|title=NettyRpcConnection.java — shutdown0()}
private void shutdown0() {
    assert eventLoop.inEventLoop();
    if (channel != null) {
        NettyFutureUtils.consume(channel.close());
        channel = null;    // ← ONLY place channel becomes null
    }
}
{code}
{code:java|title=NettyRpcDuplexHandler.java — transport-level triggers only}
@Override
public void channelInactive(ChannelHandlerContext ctx) throws Exception {
    if (!id2Call.isEmpty()) {
        cleanupCalls(new ConnectionClosedException("Connection closed"));
    }
    conn.shutdown();    // ← called on TCP break (RST, FIN)
    ctx.fireChannelInactive();
}

@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
    if (!id2Call.isEmpty()) {
        cleanupCalls(IPCUtil.toIOE(cause));
    }
    conn.shutdown();    // ← called on transport error (broken pipe, etc.)
}
{code}
An NSRE is a normal application-level response on a healthy TCP socket. It 
triggers neither {{channelInactive}} nor {{{}exceptionCaught{}}}. Therefore 
{{shutdown0()}} is never called, {{channel}} stays non-null, and {{connect()}} 
(with fresh DNS) is never invoked again.
h3. Proposed Fixes — Requesting Guidance

We would appreciate committer guidance on which of the following approaches is 
preferred. We are happy to contribute the patch and tests for whichever 
direction the project prefers.
h4. Option 1: Peer-vs-DNS drift check at the channel-reuse gate

Peer-vs-DNS drift check at the channel-reuse gate 
({{{}NettyRpcConnection.sendRequest0{}}}). Throttled re-check: if 
{{channel.remoteAddress().getAddress()}} ≠ {{{}InetAddress.getByName(host){}}}, 
call {{shutdown0()}} so the existing {{if (channel == null) connect();}} 
re-resolves. Connection-layer only, no protocol change.
h4. Option 2: Responder-identity check

On any error carrying a server identity, compare the responder's {{ServerName}} 
against {{{}loc.getServerName(){}}}. If they mismatch, call 
{{AbstractRpcClient.cancelConnections(loc.getServerName())}} to force 
re-resolve on the next send.

Today the NSRE response only carries the responder identity inside the 
exception message text. {{ExceptionResponse}} reserves 
{{{}hostname{}}}/{{{}port{}}} fields for {{RegionMovedException}} only. 
Extending the wire format with a {{responder_server_name}} field would let the 
client do this structurally rather than parsing strings.

Hook point: {{AsyncRegionLocatorHelper.updateCachedLocationOnError}} already 
receives both {{loc}} and the cause.

*Pros:* Most precise detection — only triggers when the responder is 
definitively wrong. *Cons:* Requires protobuf/wire-format change.
h3. Questions for Committers
 # Is Option 1 (DNS drift check, pure client-side fix) acceptable ?
 # Would Option 2 (responder-identity with a proto field) be considered for a 
more structural fix?
 # Are there any prior JIRAs or discussions related to this that we should link 
to / any other solutions?

  was:
Summary: Under a narrow but real Kubernetes condition — a RegionServer pod's IP 
is reassigned to a different live pod (sometimes a RegionServer of an entirely 
different HBase cluster) — the async client gets stuck issuing requests to that 
wrong-but-live server and receives a continuous stream of 
{{NotServingRegionException}} (NSRE) for the affected regions. The condition 
does not self-heal: only a client process restart fixes it.

We run HBase on Kubernetes, where pod cohost both a RegionServer and a 
DataNode. Multiple independent HBase clusters share the same Kubernetes 
environment.

We observed a failure scenario during node maintenance where an HBase async 
client can become permanently stuck talking to the wrong RegionServer after 
Kubernetes pod IP reuse.

Consider the following example:
 * *Pod A* hosts *RegionServer A* belonging to {*}HBase Cluster A{*}.
 * *Pod B* hosts *RegionServer B* belonging to {*}HBase Cluster B{*}.
 * Both pods are running on the same Kubernetes node.

During a maintenance activity (node reboot, drain, upgrade, etc.), all pods on 
the node restart.

A possible sequence is:
 # Pod A goes down.
 # Kubernetes later reassigns Pod A's old IP address to Pod B.
 # The client already has an established TCP connection to Pod A's old IP.
 # Because the connection remains alive through the networking/service-mesh 
layer, the client does not see a transport failure.
 # Requests intended for RegionServer A are now delivered to RegionServer B, 
which belongs to a completely different HBase cluster and has never hosted the 
requested regions.
 # RegionServer B correctly responds with {{{}NotServingRegionException 
(NSRE){}}}.
 # The client continues reusing the same underlying RPC connection and it gets 
following continuously 

2026-06-09T06:03:15.226Z, org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: 
app_ns:my-table,rowprefix-0001,1761217611582.65e37b957f2a12c0c710d3866bece520. 
is not online on 
hbase-B-dn-4.hbase-B.k8s-namespace.svc.cluster.local,16020,1780977222833
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3552)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3530)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1486)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2972)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44994)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
 
Note: The table `app_ns:my-table` belongs to HBase cluster-A. The RegionServer 
named in the NSRE (`hbase-B-dn-4`) belongs to a completely different HBase 
cluster 
(cluster-B). This table has never existed on cluster-B. The client is reaching 
cluster-B's RegionServer only because cluster-B's pod acquired the IP address 
that 
previously belonged to cluster-A's pod (hbase-A-dn-7). After a Kubernetes node 
maintenance event, both pods restarted on the same node. Cluster-B's pod came 
up 
first and was assigned cluster-A's old pod IP. The client's existing TCP 
channel 
was pinned to that IP and never re-resolved DNS, so all RPCs intended for 
cluster-A's RegionServer are landing on cluster-B's RegionServer instead.
 # Requests continue hitting RegionServer B and receive NSREs indefinitely.

In production, we observed this condition persisting for approximately an hour 
and generating tens of thousands of NSREs. Recovery occurred only after 
restarting the hbase client process

*Observed Behavior*
 * Continuous NSREs for the same regions.
 * The same RegionServer appears in all NSRE responses.
 * The responding RegionServer belongs to a different HBase cluster than the 
target table.
 * No transport errors or connection failures are observed.
 * Client restart immediately restores normal operation.

*Expected Behavior*

When the client receives repeated NSREs from a RegionServer that does not match 
the expected destination, it should eventually drop the existing connection and 
establish a fresh one, allowing DNS re-resolution and recovery without 
requiring a client restart.



h3. Problem Identified

The HBase async client's \{{NettyRpcConnection}} resolves DNS exactly once — 
when the channel is first created — and never re-checks it for the lifetime of 
that channel. If the underlying IP changes (e.g., Kubernetes pod IP reuse), the 
channel remains pinned to the old (now wrong) IP indefinitely. An NSRE is an 
application-level response and does not trigger channel closure, so the client 
never gets a chance to re-resolve DNS.

h4. 1. DNS resolution happens only once per channel, inside \{{connect()}}

While \{{channel != null}}, the existing channel is reused indefinitely and DNS 
is never re-checked.

{code:java|title=NettyRpcConnection.java — sendRequest0()}
@Override
public void run(boolean cancelled) throws IOException {
    if (cancelled) {
        setCancelled(call);
    } else {
        if (channel == null) {       // ← ONLY path to DNS resolution
            connect();
        }
        scheduleTimeoutTask(call);
        NettyFutureUtils.addListener(channel.writeAndFlush(call), new 
ChannelFutureListener() {
            @Override
            public void operationComplete(ChannelFuture future) throws 
Exception {
                if (!future.isSuccess()) {
                    call.setException(toIOE(future.cause()));
                }
            }
        });
    }
}
{code}

h4. 2. NSRE does not close the channel

The \{{channel = null}} reset lives only in \{{shutdown0()}}, which is 
triggered by transport failures (\{{channelInactive}}, \{{exceptionCaught}}) or 
explicit shutdown — never by an application-level NSRE response:

{code:java|title=NettyRpcConnection.java — shutdown0()}
private void shutdown0() {
    assert eventLoop.inEventLoop();
    if (channel != null) {
        NettyFutureUtils.consume(channel.close());
        channel = null;    // ← ONLY place channel becomes null
    }
}
{code}

{code:java|title=NettyRpcDuplexHandler.java — transport-level triggers only}
@Override
public void channelInactive(ChannelHandlerContext ctx) throws Exception {
    if (!id2Call.isEmpty()) {
        cleanupCalls(new ConnectionClosedException("Connection closed"));
    }
    conn.shutdown();    // ← called on TCP break (RST, FIN)
    ctx.fireChannelInactive();
}

@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
    if (!id2Call.isEmpty()) {
        cleanupCalls(IPCUtil.toIOE(cause));
    }
    conn.shutdown();    // ← called on transport error (broken pipe, etc.)
}
{code}

An NSRE is a normal application-level response on a healthy TCP socket. It 
triggers neither \{{channelInactive}} nor \{{exceptionCaught}}. Therefore 
\{{shutdown0()}} is never called, \{{channel}} stays non-null, and 
\{{connect()}} (with fresh DNS) is never invoked again.

h4. 3. The resulting loop

{noformat}
NSRE
  → region location cache evicted (correct)
  → meta re-read → returns same hostname (correct — region IS assigned to this 
RS)
  → connection pool lookup by hostname → returns same NettyRpcConnection
  → sendRequest0() → channel != null → skips connect() → skips DNS
  → writeAndFlush on channel pinned to wrong IP
  → reaches wrong RegionServer
  → NSRE
  → repeat forever
{noformat}

----

h3. Proposed Fixes — Requesting Guidance

We would appreciate committer guidance on which of the following approaches is 
preferred. We are happy to contribute the patch and tests for whichever 
direction the project prefers.

h4. Option 1: Peer-vs-DNS drift check at the channel-reuse gate

In \{{NettyRpcConnection.sendRequest0()}}, periodically compare the channel's 
connected IP against current DNS. If they differ, call \{{shutdown0()}} so the 
existing \{{if (channel == null) connect();}} path re-resolves. Throttled to 
avoid excessive DNS lookups (e.g., once every 30 seconds).

{code:java|title=Proposed change — NettyRpcConnection.java}
// New fields
private InetAddress channelConnectedIp;
private long lastDnsCheckTime;

public static final String DNS_CHECK_INTERVAL_KEY =
    "hbase.client.dns.check.interval.ms";
public static final long DNS_CHECK_INTERVAL_DEFAULT = 30_000;

// In connect():
private void connect() throws UnknownHostException {
    InetSocketAddress remoteAddr = getRemoteInetAddress(rpcClient.metrics);
    this.channelConnectedIp = remoteAddr.getAddress();
    this.lastDnsCheckTime = EnvironmentEdgeManager.currentTime();
    this.channel = new Bootstrap()
        .group(eventLoop).channel(rpcClient.channelClass)
        .remoteAddress(remoteAddr).connect().channel();
}

// New method:
private boolean hasIpChanged() {
    long now = EnvironmentEdgeManager.currentTime();
    long interval = rpcClient.conf.getLong(
        DNS_CHECK_INTERVAL_KEY, DNS_CHECK_INTERVAL_DEFAULT);
    if (now - lastDnsCheckTime < interval) {
        return false;
    }
    lastDnsCheckTime = now;
    try {
        InetAddress currentIp = InetAddress.getByName(
            remoteId.getAddress().getHostName());
        return !currentIp.equals(channelConnectedIp);
    } catch (UnknownHostException e) {
        return false;
    }
}

// Modified sendRequest0():
if (channel == null || hasIpChanged()) {
    if (channel != null) {
        LOG.warn("DNS for {} changed from {}. Reconnecting.",
            remoteId.getAddress(), channelConnectedIp);
        shutdown0();
    }
    connect();
}
{code}

*Pros:* Connection-layer only. No protocol change. Single file change. Zero 
false positives — channel is only closed when the IP actually changed. Overhead 
is one DNS lookup every 30s per RS connection (~0.1ms).

h4. Option 2: Responder-identity check

On any error carrying a server identity, compare the responder's 
\{{ServerName}} against \{{loc.getServerName()}}. If they mismatch, call 
\{{AbstractRpcClient.cancelConnections(loc.getServerName())}} to force 
re-resolve on the next send.

Today the NSRE response only carries the responder identity inside the 
exception message text. \{{ExceptionResponse}} reserves \{{hostname}}/\{{port}} 
fields for \{{RegionMovedException}} only. Extending the wire format with a 
\{{responder_server_name}} field would let the client do this structurally 
rather than parsing strings.

Hook point: \{{AsyncRegionLocatorHelper.updateCachedLocationOnError}} already 
receives both \{{loc}} and the cause.

*Pros:* Most precise detection — only triggers when the responder is 
definitively wrong. *Cons:* Requires protobuf/wire-format change.

h4. Option 3: Consecutive NSRE counter per connection

Track consecutive NSREs per \{{NettyRpcConnection}}. After N consecutive NSREs 
(e.g., 3), close the channel. Reset counter on any successful response.

*Pros:* Simple, no protocol change. *Cons:* Slightly less precise than Option 1 
— relies on a threshold heuristic rather than direct IP comparison.

----

h3. Questions for Committers

# Is Option 1 (DNS drift check, pure client-side fix) acceptable for 
\{{master}} and \{{branch-2.5}}?
# Would Option 2 (responder-identity with a proto field) be considered for a 
more structural fix?
# Are there any prior JIRAs or discussions related to this that we should link 
to?


> RPC connection pinned to stale IP after cross-pod IP reuse: NSRE storm 
> persists indefinitely because pooled channel is reused without re-resolving 
> DNS
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-30245
>                 URL: https://issues.apache.org/jira/browse/HBASE-30245
>             Project: HBase
>          Issue Type: Bug
>         Environment: * HBase 2.5.12 client (present in other client versions 
> as well)
>  * HBase deployed on Kubernetes.
>  * RegionServer and DataNode are co-hosted in the same pod.
>  * Multiple HBase clusters run in the same Kubernetes environment.
>            Reporter: samad
>            Priority: Major
>
> Summary: Under a narrow but real Kubernetes condition — a RegionServer pod's 
> IP is reassigned to a different live pod (sometimes a RegionServer of an 
> entirely different HBase cluster) — the async client gets stuck issuing 
> requests to that wrong-but-live server and receives a continuous stream of 
> {{NotServingRegionException}} (NSRE) for the affected regions. The condition 
> does not self-heal: only a client process restart fixes it.
> We run HBase on Kubernetes, where pod cohost both a RegionServer and a 
> DataNode. Multiple independent HBase clusters share the same Kubernetes 
> environment.
> We observed a failure scenario during node maintenance where an HBase async 
> client can become permanently stuck talking to the wrong RegionServer after 
> Kubernetes pod IP reuse.
> Consider the following example:
>  * *Pod A* hosts *RegionServer A* belonging to {*}HBase Cluster A{*}.
>  * *Pod B* hosts *RegionServer B* belonging to {*}HBase Cluster B{*}.
>  * Both pods are running on the same Kubernetes node.
> During a maintenance activity (node reboot, drain, upgrade, etc.), all pods 
> on the node restart.
> A possible sequence is:
>  # Pod A goes down.
>  # Kubernetes later reassigns Pod A's old IP address to Pod B.
>  # The client already has an established TCP connection to Pod A's old IP.
>  # Because the connection remains alive through the networking/service-mesh 
> layer, the client does not see a transport failure.
>  # Requests intended for RegionServer A are now delivered to RegionServer B, 
> which belongs to a completely different HBase cluster and has never hosted 
> the requested regions.
>  # RegionServer B correctly responds with {{{}NotServingRegionException 
> (NSRE){}}}.
>  # The client continues reusing the same underlying RPC connection and it 
> gets following continuously 
> 2026-06-09T06:03:15.226Z, org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: 
> app_ns:my-table,rowprefix-0001,1761217611582.65e37b957f2a12c0c710d3866bece520.
>  
> is not online on 
> hbase-B-dn-4.hbase-B.k8s-namespace.svc.cluster.local,16020,1780977222833
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3552)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3530)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1486)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2972)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44994)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
>  
> Note: The table `app_ns:my-table` belongs to HBase cluster-A. The 
> RegionServer 
> named in the NSRE (`hbase-B-dn-4`) belongs to a completely different HBase 
> cluster 
> (cluster-B). This table has never existed on cluster-B. The client is 
> reaching 
> cluster-B's RegionServer only because cluster-B's pod acquired the IP address 
> that 
> previously belonged to cluster-A's pod (hbase-A-dn-7). After a Kubernetes 
> node 
> maintenance event, both pods restarted on the same node. Cluster-B's pod came 
> up 
> first and was assigned cluster-A's old pod IP. The client's existing TCP 
> channel 
> was pinned to that IP and never re-resolved DNS, so all RPCs intended for 
> cluster-A's RegionServer are landing on cluster-B's RegionServer instead.
>  # Requests continue hitting RegionServer B and receive NSREs indefinitely.
> In production, we observed this condition persisting for approximately an 
> hour and generating tens of thousands of NSREs. Recovery occurred only after 
> restarting the hbase client process
> *Observed Behavior*
>  * Continuous NSREs for the same regions.
>  * The same RegionServer appears in all NSRE responses.
>  * The responding RegionServer belongs to a different HBase cluster than the 
> target table.
>  * No transport errors or connection failures are observed.
>  * Client restart immediately restores normal operation.
> *Expected Behavior*
> When the client receives repeated NSREs from a RegionServer that does not 
> match the expected destination, it should eventually drop the existing 
> connection and establish a fresh one, allowing DNS re-resolution and recovery 
> without requiring a client restart.
> h3. Problem Identified
> The HBase async client's {{NettyRpcConnection}} resolves DNS exactly once — 
> when the channel is first created — and never re-checks it for the lifetime 
> of that channel. If the underlying IP changes (e.g., Kubernetes pod IP 
> reuse), the channel remains pinned to the old (now wrong) IP indefinitely. An 
> NSRE is an application-level response and does not trigger channel closure, 
> so the client never gets a chance to re-resolve DNS.
> h4. 1. DNS resolution happens only once per channel, inside {{connect()}}
> While {{{}channel != null{}}}, the existing channel is reused indefinitely 
> and DNS is never re-checked.
> {code:java|title=NettyRpcConnection.java — sendRequest0()}
> @Override
> public void run(boolean cancelled) throws IOException {
>     if (cancelled) {
>         setCancelled(call);
>     } else {
>         if (channel == null) {       // ← ONLY path to DNS resolution
>             connect();
>         }
>         scheduleTimeoutTask(call);
>         NettyFutureUtils.addListener(channel.writeAndFlush(call), new 
> ChannelFutureListener() {
>             @Override
>             public void operationComplete(ChannelFuture future) throws 
> Exception {
>                 if (!future.isSuccess()) {
>                     call.setException(toIOE(future.cause()));
>                 }
>             }
>         });
>     }
> }
> {code}
> h4. 2. NSRE does not close the channel
> The {{channel = null}} reset lives only in {{{}shutdown0(){}}}, which is 
> triggered by transport failures ({{{}channelInactive{}}}, 
> {{{}exceptionCaught{}}}) or explicit shutdown — never by an application-level 
> NSRE response:
> {code:java|title=NettyRpcConnection.java — shutdown0()}
> private void shutdown0() {
>     assert eventLoop.inEventLoop();
>     if (channel != null) {
>         NettyFutureUtils.consume(channel.close());
>         channel = null;    // ← ONLY place channel becomes null
>     }
> }
> {code}
> {code:java|title=NettyRpcDuplexHandler.java — transport-level triggers only}
> @Override
> public void channelInactive(ChannelHandlerContext ctx) throws Exception {
>     if (!id2Call.isEmpty()) {
>         cleanupCalls(new ConnectionClosedException("Connection closed"));
>     }
>     conn.shutdown();    // ← called on TCP break (RST, FIN)
>     ctx.fireChannelInactive();
> }
> @Override
> public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
>     if (!id2Call.isEmpty()) {
>         cleanupCalls(IPCUtil.toIOE(cause));
>     }
>     conn.shutdown();    // ← called on transport error (broken pipe, etc.)
> }
> {code}
> An NSRE is a normal application-level response on a healthy TCP socket. It 
> triggers neither {{channelInactive}} nor {{{}exceptionCaught{}}}. Therefore 
> {{shutdown0()}} is never called, {{channel}} stays non-null, and 
> {{connect()}} (with fresh DNS) is never invoked again.
> h3. Proposed Fixes — Requesting Guidance
> We would appreciate committer guidance on which of the following approaches 
> is preferred. We are happy to contribute the patch and tests for whichever 
> direction the project prefers.
> h4. Option 1: Peer-vs-DNS drift check at the channel-reuse gate
> Peer-vs-DNS drift check at the channel-reuse gate 
> ({{{}NettyRpcConnection.sendRequest0{}}}). Throttled re-check: if 
> {{channel.remoteAddress().getAddress()}} ≠ 
> {{{}InetAddress.getByName(host){}}}, call {{shutdown0()}} so the existing 
> {{if (channel == null) connect();}} re-resolves. Connection-layer only, no 
> protocol change.
> h4. Option 2: Responder-identity check
> On any error carrying a server identity, compare the responder's 
> {{ServerName}} against {{{}loc.getServerName(){}}}. If they mismatch, call 
> {{AbstractRpcClient.cancelConnections(loc.getServerName())}} to force 
> re-resolve on the next send.
> Today the NSRE response only carries the responder identity inside the 
> exception message text. {{ExceptionResponse}} reserves 
> {{{}hostname{}}}/{{{}port{}}} fields for {{RegionMovedException}} only. 
> Extending the wire format with a {{responder_server_name}} field would let 
> the client do this structurally rather than parsing strings.
> Hook point: {{AsyncRegionLocatorHelper.updateCachedLocationOnError}} already 
> receives both {{loc}} and the cause.
> *Pros:* Most precise detection — only triggers when the responder is 
> definitively wrong. *Cons:* Requires protobuf/wire-format change.
> h3. Questions for Committers
>  # Is Option 1 (DNS drift check, pure client-side fix) acceptable ?
>  # Would Option 2 (responder-identity with a proto field) be considered for a 
> more structural fix?
>  # Are there any prior JIRAs or discussions related to this that we should 
> link to / any other solutions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-30245) RPC connection pinned to stale IP after cross-pod IP reuse: NSRE storm persists indefinitely because pooled channel is reused without re-resolving DNS

Reply via email to