qianye1001 opened a new issue, #10510:
URL: https://github.com/apache/rocketmq/issues/10510
## Motivation
When a TCP connection is broken without a RST (e.g., intermediate network
device silently drops packets, process killed with `kill -9`), the gRPC client
needs to rely on HTTP/2 PING (keepalive) to detect the dead connection. Until
the keepalive mechanism detects the failure, **all requests on that connection
will fail** and have to wait until their deadline expires.
Currently, the gRPC server in the Proxy does **not** configure
`permitKeepAliveTime` or `permitKeepAliveWithoutCalls`, which means gRPC Netty
server defaults apply:
- `permitKeepAliveTime` = **5 minutes** — clients cannot send keepalive
pings more frequently than every 5 minutes, or the server will send a GOAWAY
- `permitKeepAliveWithoutCalls` = **false** — keepalive pings on idle
connections (no active RPCs) are rejected by the server
This causes two problems:
1. **Slow dead-connection detection**: The `rocketmq-clients` Java SDK sets
`keepAliveTime = 300s` (5 min), so worst-case detection time is **5 min + 30s =
5.5 minutes**. During this window, all sends to the affected endpoint fail.
2. **Idle connection keepalive ineffective**: Although `rocketmq-clients`
sets `keepAliveWithoutCalls(true)`, the server's default
`permitKeepAliveWithoutCalls = false` silently rejects these pings, making idle
connection health detection impossible.
Additionally, the server itself does not configure `keepAliveTime` or
`keepAliveTimeout`, so it cannot proactively detect dead client connections
either.
## Proposed Changes
Add the following configurable parameters to `ProxyConfig`:
| Parameter | Default Value | Description |
|-----------|---------------|-------------|
| `grpcServerPermitKeepAliveTimeMillis` | `10000` (10s) | Minimum time a
client should wait before sending each keepalive ping |
| `grpcServerPermitKeepAliveWithoutCalls` | `true` | Whether to allow
keepalive pings when there are no outstanding RPCs |
| `grpcServerKeepAliveTimeMillis` | `60000` (60s) | Time between server-side
keepalive pings to detect dead clients |
| `grpcServerKeepAliveTimeoutMillis` | `10000` (10s) | Timeout for server
keepalive ping ACK before closing connection |
Apply these in `GrpcServerBuilder`:
```java
serverBuilder
.permitKeepAliveTime(config.getGrpcServerPermitKeepAliveTimeMillis(),
TimeUnit.MILLISECONDS)
.permitKeepAliveWithoutCalls(config.isGrpcServerPermitKeepAliveWithoutCalls())
.keepAliveTime(config.getGrpcServerKeepAliveTimeMillis(),
TimeUnit.MILLISECONDS)
.keepAliveTimeout(config.getGrpcServerKeepAliveTimeoutMillis(),
TimeUnit.MILLISECONDS);
```
### Impact
- **Overhead is minimal**: Each keepalive PING frame is only ~70 bytes
(including TCP/IP headers). At 30s interval per connection, this adds ~8KB/hour
per connection.
- **Backward compatible**: Default values are more permissive than the
current implicit defaults, but won't break existing clients. Clients with
longer keepalive intervals (e.g., current 300s) will continue to work fine.
- Once the server permits shorter keepalive intervals, client-side
improvements (reducing `keepAliveTime` from 300s to 30s) can bring
dead-connection detection time from **5.5 minutes down to ~40 seconds**.
### Related Code
- `GrpcServerBuilder`:
`proxy/src/main/java/org/apache/rocketmq/proxy/grpc/GrpcServerBuilder.java`
- `ProxyConfig`:
`proxy/src/main/java/org/apache/rocketmq/proxy/config/ProxyConfig.java`
- Client keepalive settings: `keepAliveTime=300s, keepAliveTimeout=30s,
keepAliveWithoutCalls=true` in rocketmq-clients Java SDK (`RpcClientImpl.java`)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]