Hello all,

Currently, Pulsar CI lacks the capability to systematically detect and
report Netty buffer leaks. This creates a significant blind spot in
our quality assurance process, as these leaks can slip through
undetected until they manifest as performance degradation, resource
exhaustion, or unpredictable failures in production environments.

I've submitted a PR (#24272) that addresses this gap by implementing
advanced Netty leak detection capabilities within our CI pipeline. The
implementation follows a staged approach:

1. First, enable leak detection and reporting without failing CI builds
2. Fix identified leaks in both test and production code
3. Eventually, enable strict enforcement where CI builds would fail
when leaks are detected

The PR adds a custom ExtendedNettyLeakDetector implementation,
configures it to output detailed reports to a designated directory,
and adds reporting steps to all CI workflows to display leaks directly
in the GitHub Actions UI. It also enhances the PulsarTestListener to
trigger leak detection at key test lifecycle events and provides
capabilities to collect and report leaks from integration tests.
Most of the detected leaks are in test code, but some seem to be in
production code. I've used this solution for the last couple of months
to detect leaks in Pulsar code by running tests locally, and it works
well.

By catching these issues early in the development cycle, we can create
an automated safety net that prevents future Netty buffer management
regressions and significantly improves system stability and resource
efficiency in production environments.

Please review the PR at https://github.com/apache/pulsar/pull/24272.

-Lari

Reply via email to