devjue opened a new issue, #16344: URL: https://github.com/apache/dubbo/issues/16344
### Pre-check - [x] I am sure that all the content I provide is in English. ### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dubbo/issues?q=is%3Aissue) and found no similar issues. ### Apache Dubbo Component Java SDK (apache/dubbo) ### Dubbo Version Dubbo 3.3 (also reproducible on 3.2.x), JDK 8/17, Linux ### Steps to reproduce this issue 1. A Dubbo Triple Consumer talks to any HTTP/2 server (gateway / reverse proxy / direct Provider) that periodically emits `GOAWAY(errorCode=0, lastStreamId=MAX_INT)`. Common triggers: `max_requests_per_connection`, `idle_timeout`, `max_connection_duration`, hot-restart drain, overload manager. 2. Send sustained traffic (e.g. 1 QPS) over the connection. 3. Every GOAWAY produces exactly one failed request, then traffic recovers within ~1s. Captured call chain on the Consumer side: ``` SslHandler.decode -> DefaultHttp2FrameReader.readGoAwayFrame -> Http2FrameCodec.onHttp2Frame -> Http2MultiplexHandler -> TripleGoAwayHandler.channelRead(DefaultHttp2GoAwayFrame) -> NettyConnectionHandler.onGoAway -> AbstractNettyConnectionClient.onGoaway(channel) // sets channel ref to null ``` ### What you expected to happen A graceful GOAWAY (`errorCode=0`, `lastStreamId=MAX_INT`) means the server still accepts new streams on the old channel, so it should be transparent to callers — no request failure window. Actual behaviour: `AbstractNettyConnectionClient#onGoaway` immediately nulls the channel reference, then `NettyConnectionHandler#onGoAway` schedules a reconnect. In the gap, the application logs: ``` WARN SingleRouterChain - filter result is empty WARN AbstractDirectory - validInvokers: 0, invokersToReconnect: 1 ERROR ... - No provider available for service ... ``` `AbstractClusterInvoker#checkInvokers` throws `RpcException` **before** entering FailoverCluster's retry loop, so configuring `retries=N` does **not** mitigate the issue. Root causes: 1. The channel is nulled immediately instead of being kept alive during a graceful migration; with `lastStreamId=MAX_INT` the old channel could still serve. 2. `NettyConnectionClient#initBootstrap`'s `closeFuture` listener calls `clearNettyChannel()` unconditionally — when a new channel is later swapped in and the old one closes, the listener wipes the new reference (~1.3s downtime observed in tests). 3. Scheduling `doConnect` on the Netty I/O `EventLoop` is unsafe: `awaitUninterruptibly(connectTimeout)` blocks the I/O thread, freezing every other channel sharing that EventLoop during TLS handshake. ### Anything else Happens **every time** a GOAWAY is received — exactly one failed request per GOAWAY. Affects any Triple Consumer behind any HTTP/2 server that emits graceful GOAWAY. Because Dubbo connections are typically shared across many `@DubboReference` instances, a single GOAWAY can fan out to multiple downstream calls. I have a working patch on top of the `3.3` branch (5 files, +178 / -18) implementing graceful migration: 1. `NettyConnectionHandler#onGoAway`: keep old channel alive; schedule new connect on `connectivityExecutor` after `GRACEFUL_RECONNECT_DELAY_MS = 200ms`; one retry on failure, then fall back to `AbstractNettyConnectionClient#onGoaway`. 2. `NettyConnectionClient#initBootstrap`: change `closeFuture` listener from `clearNettyChannel()` to `compareAndClearNettyChannel(ch)` so the old channel's close listener cannot wipe a freshly swapped-in new channel. 3. `AbstractNettyConnectionClient`: add `compareAndClearNettyChannel`, plus package-private `getConnectivityExecutor()` / `getReconnectDuration()` accessors. 4. `NettyChannel#getChannelIfPresent`: lookup-only API used by `channelInactive` to avoid allocating a transient `NettyChannel` purely for logging. 5. `TripleGoAwayHandler`: log `errorCode` and `lastStreamId` so the GOAWAY trigger is observable. Validated against a local Triple demo: 26 GOAWAYs, 100% migration success, zero `ERROR` / `No provider`. Channel rotation chain (first 18 migrations): ``` :45956 -> :45110 -> :53222 -> :58366 -> :56372 -> :55436 -> :55188 -> :52926 -> :50682 -> :48682 -> :47626 -> :44014 -> :43116 -> :45044 -> :48602 -> :51880 -> :50992 -> :51672 ``` Happy to submit a PR against `3.3` and backport to `3.2` if maintainers agree with the design. ### Do you have a (mini) reproduction demo? - [ ] Yes, I have a minimal reproduction demo to help resolve this issue more effectively! ### Are you willing to submit a pull request to fix on your own? - [x] Yes I am willing to submit a pull request on my own! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
