lhotari opened a new pull request, #26013:
URL: https://github.com/apache/pulsar/pull/26013
Fixes #26006
### Motivation
`PublishRateLimiterOverconsumingTest.testOverconsumingTokensWithBrokerPublishRateLimiter`
started
failing frequently in CI on 2026-06-11. Investigation showed this is not a
test problem: the broker
publish rate limiter genuinely stopped pacing publishes. With a broker-level
limit of 500 msg/s, all
3000 test messages were accepted in ~2 seconds (observed per-second receive
windows such as
`[1416, 1584]` with no effective throttling).
The regression surfaced with the Netty 4.1.135 → 4.2.15 upgrade (part of
#25886, the BookKeeper
4.18.0 upgrade). An A/B comparison on the same machine confirms it: at the
parent commit of the
upgrade the test passes with windows `[1308, 492, 504, 486, 210]`; at the
upgrade commit it fails
with windows like `[1314, 1686]` on both attempts.
The root cause is a behavior change in
`io.netty.handler.flow.FlowControlHandler`, which the broker
pipeline places directly before `ServerCnx` so that disabling auto-read
stops the delivery of
already-decoded messages:
- Up to Netty 4.1.135 and 4.2.14, the handler's dequeue loop re-checked
`config.isAutoRead()`
before releasing each queued message
(`while (queue != null && (consumed < minConsume ||
config.isAutoRead()))`), so
`ServerCnxThrottleTracker` calling `setAutoRead(false)` from inside
`fireChannelRead` paused
delivery immediately — reactive request throttling operated at
single-message granularity.
Instrumentation at the pre-upgrade commit shows the token balance of the
publish rate limiter's
token bucket staying within `[-25, +14]` across 212 unthrottle passes,
with the configured rate
emerging from rapid resume/re-throttle micro-cycles.
- Netty 4.2.15 (netty/netty#16837, backport of netty/netty#15053) rewrote
the handler to decide
once up front: with auto-read enabled it dequeues the entire queue and
ignores auto-read changes
made mid-drain by downstream handlers. A throttled connection that resumes
now delivers its whole
queued backlog before the renewed `setAutoRead(false)` takes effect.
Instrumentation on Netty
4.2.15 shows unthrottle passes followed by token balances of `-666` and
`-1734` — all connections
dumped their backlog at once, overconsuming tokens by 3x and more. This
affects every
`ServerCnxThrottleTracker`-based limit (publish rate, max pending publish
requests, in-flight
publish memory), not just publish rate limiting.
- The same rewrite landed on the Netty 4.1 branch after the 4.1.135 release
(netty/netty#16912),
so the old behavior will not return in future Netty versions of either
line.
No change to `PublishRateLimiterImpl` is needed. Its unthrottling pass
resumes all queued
producers, but with per-message auto-read responsiveness restored, the worst
case is each resumed
producer consuming a single queued entry after the rate is already exceeded;
the slightly negative
token balance then makes `AsyncTokenBucket.calculateThrottlingDuration()`
delay the next round.
Verified by instrumentation with this fix in place: token balance stays
within `[-25, +13]` across
205 unthrottle passes and the unthrottling queue never exceeds the producer
count.
### Modifications
- Add `org.apache.pulsar.common.util.netty.PulsarFlowControlHandler`, a
verbatim copy of Netty
4.2.14.Final's `io.netty.handler.flow.FlowControlHandler` (behaviorally
identical to
4.1.135.Final; only the license header, package, class name and javadoc
differ), preserving the
per-message auto-read check that Pulsar's reactive throttling depends on.
- Use it in `PulsarChannelInitializer` instead of Netty's handler.
- Add `PulsarFlowControlHandlerTest` pinning the critical property:
`setAutoRead(false)` from a
downstream `channelRead` stops delivery of queued messages immediately
(this test fails against
Netty 4.2.15's `FlowControlHandler`).
### Verifying this change
- [x] Make sure that the change passes the CI checks.
This change is already covered by existing tests, such as
`PublishRateLimiterOverconsumingTest.testOverconsumingTokensWithBrokerPublishRateLimiter`.
Verified locally with `invocationCount = 10`: 10/10 passes with per-second
receive windows back to
the pre-upgrade precision (e.g. `[1290, 498, 498, 498, 216]` for a 500 msg/s
limit; the first
window contains the token bucket's by-design initial capacity burst and is
skipped by the test),
where previously the test failed nearly deterministically on Netty 4.2.15.
Also verified
`PublishRateLimiterTest`, `PublishRateLimiterDisableTest`,
`MessagePublishThrottlingTest`,
`TopicPublishThrottlingInitTest` and the new `PulsarFlowControlHandlerTest`
locally.
### Does this pull request potentially affect one of the following parts:
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]