Hello all, Currently, Pulsar CI lacks the capability to systematically detect and report Netty buffer leaks. This creates a significant blind spot in our quality assurance process, as these leaks can slip through undetected until they manifest as performance degradation, resource exhaustion, or unpredictable failures in production environments.
I've submitted a PR (#24272) that addresses this gap by implementing advanced Netty leak detection capabilities within our CI pipeline. The implementation follows a staged approach: 1. First, enable leak detection and reporting without failing CI builds 2. Fix identified leaks in both test and production code 3. Eventually, enable strict enforcement where CI builds would fail when leaks are detected The PR adds a custom ExtendedNettyLeakDetector implementation, configures it to output detailed reports to a designated directory, and adds reporting steps to all CI workflows to display leaks directly in the GitHub Actions UI. It also enhances the PulsarTestListener to trigger leak detection at key test lifecycle events and provides capabilities to collect and report leaks from integration tests. Most of the detected leaks are in test code, but some seem to be in production code. I've used this solution for the last couple of months to detect leaks in Pulsar code by running tests locally, and it works well. By catching these issues early in the development cycle, we can create an automated safety net that prevents future Netty buffer management regressions and significantly improves system stability and resource efficiency in production environments. Please review the PR at https://github.com/apache/pulsar/pull/24272. -Lari