andrijapanicsb opened a new pull request, #13407:
URL: https://github.com/apache/cloudstack/pull/13407

   ### Description
   
   This PR makes the HikariCP **leak-detection threshold** and **JMX MBean 
registration** configurable per database pool via `db.properties`, instead of 
relying on HikariCP defaults that cannot be changed without a code change.
   
   CloudStack already maps a subset of `db.properties` values onto 
`HikariConfig` in 
`framework/db/src/main/java/com/cloud/utils/db/TransactionLegacy.java` (e.g. 
`maxActive`, `maxIdle`, `maxWait`, `minIdleConnections`, `connectionTimeout`, 
`keepAliveTime`). This PR adds two more, following the exact same 
parsing/threading pattern:
   
   | Property | Maps to | Default |
   |---|---|---|
   | `db.<pool>.leakDetectionThreshold` | 
`HikariConfig#setLeakDetectionThreshold(long)` | `0` (disabled) |
   | `db.<pool>.registerMbeans` | `HikariConfig#setRegisterMbeans(boolean)` | 
`false` (disabled) |
   
   Supported for all three pools that use the shared datasource factory: 
`cloud`, `usage`, `simulator`.
   
   **Behaviour:**
   - `leakDetectionThreshold` absent or `0` → leak detection disabled 
(unchanged default behaviour). Only applied when set to a value `> 0`. 
*(HikariCP itself ignores values below 2000 ms with a warning.)*
   - `registerMbeans` absent or `false` → MBeans disabled (unchanged default). 
`true` → Hikari JMX MBeans registered for live pool-counter observation.
   - Settings are applied **only on the HikariCP path**; the DBCP datasource 
path is unaffected.
   - A debug-level log line reports the effective values per pool (no 
credentials or sensitive data).
   
   **Motivation / context:** in production we saw the management server become 
unstable — and eventually crash — on clusters exercising Host-HA. Watching 
MySQL with `SHOW PROCESSLIST` during the incident showed the number of sessions 
owned by the `cloud` DB user climbing steadily over a couple of hours, **all of 
them in the `Sleep` state**, until the HikariCP pool (`db.cloud.maxActive`, 
default `250`) was exhausted and the server could no longer borrow a 
connection. That signature — monotonically growing, never-reaped, all idle, all 
owned by the `cloud` user — is a classic DB connection leak in a periodic code 
path (suspected Host-HA host checks) that borrows a pooled connection and never 
returns it.
   
   The problem is these symptoms tell you *that* connections leak, not *where*. 
HikariCP already has the exact tool for that — `leakDetectionThreshold` — but 
CloudStack hard-wires it off with no way to turn it on. This PR exposes it (and 
`registerMbeans`) through `db.properties` so an operator can enable leak 
detection on a live server; HikariCP then logs an `Apparent connection leak 
detected` stack trace identifying the precise code path that borrowed the 
connection and failed to return it, and the MBeans give live pool-counter 
visibility. The actual leak fix is a separate change; this PR is the diagnostic 
enabler.
   
   Everything is **disabled by default**, so there is no behavioural change for 
existing deployments that don't set the new properties.
   
   <!-- Fixes: # -->
   
   ### Types of changes
   
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [ ] New feature (non-breaking change which adds functionality)
   - [ ] Bug fix (non-breaking change which fixes an issue)
   - [x] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   - [ ] Build/CI
   - [ ] Test (unit or integration test code)
   
   ### Feature/Enhancement Scale or Bug Severity
   
   #### Feature/Enhancement Scale
   
   - [ ] Major
   - [x] Minor
   
   #### Bug Severity
   
   - [ ] BLOCKER
   - [ ] Critical
   - [ ] Major
   - [ ] Minor
   - [ ] Trivial
   
   ### Screenshots (if appropriate):
   
   N/A
   
   ### How Has This Been Tested?
   
   **Build:** compiled the affected module and its dependencies off tag 
`4.22.1.0`:
   
   ```
   mvn -pl framework/db -am -DskipTests compile
   ```
   Result: **BUILD SUCCESS** (checkstyle passed). The change is confined to 
property parsing and threading through the existing `createDataSource` → 
`createHikaricpDataSource` chain, reusing the existing `parseNumber(...)` 
helper.
   
   **Unit tests:** the apply-logic is factored into a package-private 
`applyHikariDebugSettings(HikariConfig, Long, Boolean, String)` and covered by 
4 new `TransactionLegacyTest` cases — defaults-disabled, `0`-keeps-disabled, 
leak-detection-enabled (`60000`), and register-MBeans-enabled:
   
   ```
   mvn -pl framework/db test 
-Dtest='TransactionLegacyTest#applyHikariDebugSettings*'
   ```
   Result: **Tests run: 4, Failures: 0, Errors: 0** — BUILD SUCCESS.
   
   **Runtime validation plan (on a patched management server):**
   
   1. In `/etc/cloudstack/management/db.properties`:
      ```
      db.cloud.leakDetectionThreshold=60000
      db.cloud.registerMbeans=true
      ```
   2. `systemctl restart cloudstack-management`
   3. Reproduce the Host-HA scenario and watch:
      ```
      grep -iE "leak detection|Apparent connection leak|ProxyLeakTask|Hikari" \
        /var/log/cloudstack/management/management-server.log
      ```
      Expected: `java.lang.Exception: Apparent connection leak detected` with a 
stack trace through `com.zaxxer.hikari.HikariDataSource.getConnection(...)` → 
`com.cloud.utils.db.TransactionLegacy...` identifying the borrowing path.
   4. With `registerMbeans=true`, the `com.zaxxer.hikari:type=Pool (cloud)` 
MBean is visible via JMX for live pool counters.
   
   #### How did you try to break this feature and the system with this change?
   
   Edge cases considered:
   - Property **absent** → null → leak detection disabled, 
`registerMbeans=false` (existing behaviour preserved).
   - `leakDetectionThreshold=0` → not applied (disabled).
   - `leakDetectionThreshold` between 1–1999 ms → passed to Hikari, which warns 
and ignores it (documented Hikari behaviour; noted in the code comment and the 
sample config).
   - `registerMbeans=false` explicitly → MBeans off.
   - DBCP connection-pool path → new values are not forwarded, so behaviour is 
unchanged.
   - Default datasource fallback path (`getDefaultHikaricpDataSource`) → 
untouched.
   
   These cases (defaults, `0`, enabled threshold, enabled MBeans) are locked 
down by the new `applyHikariDebugSettings` unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to