lhotari opened a new pull request, #26013:
URL: https://github.com/apache/pulsar/pull/26013

   Fixes #26006
   
   ### Motivation
   
   
`PublishRateLimiterOverconsumingTest.testOverconsumingTokensWithBrokerPublishRateLimiter`
 started
   failing frequently in CI on 2026-06-11. Investigation showed this is not a 
test problem: the broker
   publish rate limiter genuinely stopped pacing publishes. With a broker-level 
limit of 500 msg/s, all
   3000 test messages were accepted in ~2 seconds (observed per-second receive 
windows such as
   `[1416, 1584]` with no effective throttling).
   
   The regression surfaced with the Netty 4.1.135 → 4.2.15 upgrade (part of 
#25886, the BookKeeper
   4.18.0 upgrade). An A/B comparison on the same machine confirms it: at the 
parent commit of the
   upgrade the test passes with windows `[1308, 492, 504, 486, 210]`; at the 
upgrade commit it fails
   with windows like `[1314, 1686]` on both attempts.
   
   The root cause is a behavior change in 
`io.netty.handler.flow.FlowControlHandler`, which the broker
   pipeline places directly before `ServerCnx` so that disabling auto-read 
stops the delivery of
   already-decoded messages:
   
   - Up to Netty 4.1.135 and 4.2.14, the handler's dequeue loop re-checked 
`config.isAutoRead()`
     before releasing each queued message
     (`while (queue != null && (consumed < minConsume || 
config.isAutoRead()))`), so
     `ServerCnxThrottleTracker` calling `setAutoRead(false)` from inside 
`fireChannelRead` paused
     delivery immediately — reactive request throttling operated at 
single-message granularity.
     Instrumentation at the pre-upgrade commit shows the token balance of the 
publish rate limiter's
     token bucket staying within `[-25, +14]` across 212 unthrottle passes, 
with the configured rate
     emerging from rapid resume/re-throttle micro-cycles.
   - Netty 4.2.15 (netty/netty#16837, backport of netty/netty#15053) rewrote 
the handler to decide
     once up front: with auto-read enabled it dequeues the entire queue and 
ignores auto-read changes
     made mid-drain by downstream handlers. A throttled connection that resumes 
now delivers its whole
     queued backlog before the renewed `setAutoRead(false)` takes effect. 
Instrumentation on Netty
     4.2.15 shows unthrottle passes followed by token balances of `-666` and 
`-1734` — all connections
     dumped their backlog at once, overconsuming tokens by 3x and more. This 
affects every
     `ServerCnxThrottleTracker`-based limit (publish rate, max pending publish 
requests, in-flight
     publish memory), not just publish rate limiting.
   - The same rewrite landed on the Netty 4.1 branch after the 4.1.135 release 
(netty/netty#16912),
     so the old behavior will not return in future Netty versions of either 
line.
   
   No change to `PublishRateLimiterImpl` is needed. Its unthrottling pass 
resumes all queued
   producers, but with per-message auto-read responsiveness restored, the worst 
case is each resumed
   producer consuming a single queued entry after the rate is already exceeded; 
the slightly negative
   token balance then makes `AsyncTokenBucket.calculateThrottlingDuration()` 
delay the next round.
   Verified by instrumentation with this fix in place: token balance stays 
within `[-25, +13]` across
   205 unthrottle passes and the unthrottling queue never exceeds the producer 
count.
   
   ### Modifications
   
   - Add `org.apache.pulsar.common.util.netty.PulsarFlowControlHandler`, a 
verbatim copy of Netty
     4.2.14.Final's `io.netty.handler.flow.FlowControlHandler` (behaviorally 
identical to
     4.1.135.Final; only the license header, package, class name and javadoc 
differ), preserving the
     per-message auto-read check that Pulsar's reactive throttling depends on.
   - Use it in `PulsarChannelInitializer` instead of Netty's handler.
   - Add `PulsarFlowControlHandlerTest` pinning the critical property: 
`setAutoRead(false)` from a
     downstream `channelRead` stops delivery of queued messages immediately 
(this test fails against
     Netty 4.2.15's `FlowControlHandler`).
   
   ### Verifying this change
   
   - [x] Make sure that the change passes the CI checks.
   
   This change is already covered by existing tests, such as
   
`PublishRateLimiterOverconsumingTest.testOverconsumingTokensWithBrokerPublishRateLimiter`.
   Verified locally with `invocationCount = 10`: 10/10 passes with per-second 
receive windows back to
   the pre-upgrade precision (e.g. `[1290, 498, 498, 498, 216]` for a 500 msg/s 
limit; the first
   window contains the token bucket's by-design initial capacity burst and is 
skipped by the test),
   where previously the test failed nearly deterministically on Netty 4.2.15. 
Also verified
   `PublishRateLimiterTest`, `PublishRateLimiterDisableTest`, 
`MessagePublishThrottlingTest`,
   `TopicPublishThrottlingInitTest` and the new `PulsarFlowControlHandlerTest` 
locally.
   
   ### Does this pull request potentially affect one of the following parts:
   
   - [ ] Dependencies (add or upgrade a dependency)
   - [ ] The public API
   - [ ] The schema
   - [ ] The default values of configurations
   - [ ] The threading model
   - [ ] The binary protocol
   - [ ] The REST endpoints
   - [ ] The admin CLI options
   - [ ] The metrics
   - [ ] Anything that affects deployment
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to