wangzhigang1999 opened a new issue, #7050: URL: https://github.com/apache/kyuubi/issues/7050
### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [x] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues. ### Describe the feature This feature aims to introduce dynamic resizing (scaling up and down) capabilities to Kyuubi's Engine Pool mechanism. By introducing a pluggable `PoolScalingStrategy`, a coordinator manager `EnginePoolManager`, an accessor interface `EnginePoolAccessor` to interact with specific pool implementations, and corresponding monitoring metrics `PoolMetrics`, the size of the Engine Pool can automatically adjust based on predefined policies—such as time-based rules and, potentially in the future, load-based strategies. ### Motivation **Current Situation:** Kyuubi’s Engine Pool pre-starts and manages a set of engine instances for specific users or groups to reduce session creation latency. Currently, the size of each pool is statically configured via `kyuubi.engine.pool.size`, requiring administrators to manually set and adjust this size based on expected loads. **Problems:** 1. **Resource Waste:** A statically large pool during low-load periods leaves many engines idle, consuming CPU, memory, and possibly licensing resources unnecessarily. 2. **Performance Bottlenecks and User Experience Degradation:** During peak traffic, a statically small pool may be insufficient to handle concurrency, causing session requests to queue or even time out, hurting throughput and user experience. 3. **Operational Complexity:** Manual monitoring and resizing based on experience is inefficient, reactive, error-prone, and ill-suited for complex or rapidly changing load patterns. 4. **Inadequate Cloud-Native Adaptability:** Static sizing cannot leverage cloud elasticity effectively, limiting dynamic resource allocation per actual demand—contrary to cloud-native principles. **Real-World Scenario:** We have a big data cluster submitting jobs via Kyuubi with day-night resource elasticity (e.g., 8–10 TB memory resident during the day and roughly 1.5× expanded at night), engine pools are used for certain heavy users: - **Option 1 (2 engines, 48GB RAM each):** Under nighttime peak, the few engines become bottlenecks, prone to GC pauses; often an engine hangs on GC causing jobs to be delayed by 1-2 hours and severely increasing night-time operational burden. - **Option 2 (4 engines, 32GB RAM each):** Can barely handle nighttime but engines remain occupied during daytime due to ongoing sessions and Spark Dynamic Allocation executor caching, wasting resources and affecting other non-Kyuubi jobs. This illustrates the limitations of static pools facing dynamic resources and varying loads, underscoring the urgent need for automated elastic scaling. **Goals and Benefits:** - **Improve Resource Efficiency:** Automatically shrink pool size during low load to free idle resources and cut costs. - **Enhance System Resilience:** Expand pool size proactively in high load to promptly respond to user demand, ensuring service performance and availability. - **Increase Adaptability:** Enable Kyuubi Engine Pools to automatically adapt to periodic or bursty workload fluctuations. - **Simplify Operations:** Reduce manual intervention and management complexity with automated scaling. - **Better Cloud-Native Support:** Leverage cloud platform elasticity for on-demand resource allocation. ### Describe the solution We introduce a set of new components and configurations to enable dynamic resizing of Engine Pools. The core architecture revolves around instantiating an `EnginePoolManager` for each pool (or sub-pool) requiring dynamic scaling. This manager periodically runs (as per `scaling.interval`), computes the target size through a configurable `PoolScalingStrategy`, and interacts with the concrete pool implementation via an `EnginePoolAccessor` to carry out scale-up or **graceful scale-down** operations. A cooldown period is enforced to stabilize scaling, and detailed metrics are exposed through `PoolMetrics`. #### Core Components 1. **`EnginePoolManager`** - **Responsibilities:** Manages the dynamic scaling lifecycle for a single Engine Pool identified by `poolIdentifier` (e.g., user/group key). Runs periodic scaling checks using a scheduled executor. It retrieves the current pool size and optional metrics via `EnginePoolAccessor`, calculates the desired size via `PoolScalingStrategy`, respects the cooldown period (bypassing scaling if within cooldown), and triggers resize operations as needed. It logs and reports scaling events, target and actual sizes, latencies, and errors to `PoolMetrics` and logs. - **Lifecycle:** Tied to the Engine Pool instance in the Kyuubi server, created and started on server/pool startup, and gracefully stopped on shutdown or pool destruction. 2. **`PoolScalingStrategy`** (Pluggable Interface) - **Responsibilities:** Defines the core logic for computing the target pool size. Must be stateless or serializable if needed for configuration distribution. Receives a `PoolContext` with pool ID, current time, current size, min/max bounds, and optional load/performance metrics collected from the pool. Returns a desired target size, ideally within min/max bounds (final bounds enforcement is done by `EnginePoolManager`). - Allows users to implement and plug in custom scaling algorithms. 3. **`EnginePoolAccessor`** (Interface to the Pool Implementation) - **Responsibilities:** Abstracts interaction with the concrete Engine Pool implementations (`EnginePool`, `UserGroupAwareEnginePool`, etc.). Provides: - Precise retrieval of the **current effective pool size** (excluding starting or pending-removal engines). - Execution of **scale-up** commands (for example, creating new engines asynchronously). - Collection of internal metrics useful to scaling decisions (active sessions, pending sessions, idle engines, pending removal counts, etc.). 4. **`PoolMetrics`** Interface - **Responsibilities:** Defines APIs to report and monitor dynamic scaling activities such as current and target pool sizes, scaling events (scale-ups and downs), scaling latencies, and errors. - A default implementation will integrate with Kyuubi’s existing `MetricsSystem` to register gauges, counters, timers, etc., with appropriate labels to distinguish pools. ```mermaid sequenceDiagram title Dynamic Scaling Check Sequence participant S as Scheduler (in Manager) participant EPM as EnginePoolManager participant EPA as EnginePoolAccessor participant PSS as PoolScalingStrategy participant PM as PoolMetrics S ->>+ EPM: Trigger Scaling Check (Every Interval) EPM ->> EPM: Check Cooldown Period opt Cooldown Active EPM -->> S: Skip Check (In Cooldown) end EPM ->>+ EPA: getCurrentSize() EPA -->>- EPM: currentSize EPM ->> PM: recordPoolSize(currentSize) EPM ->>+ EPA: collectMetrics() EPA -->>- EPM: poolMetricsMap EPM ->> EPM: Create PoolContext(currentSize, poolMetricsMap, ...) EPM ->>+ PSS: calculateTargetSize(context) PSS -->>- EPM: targetSizeRaw EPM ->> EPM: Clamp targetSize = max(minSize, min(maxSize, targetSizeRaw)) EPM ->> PM: recordTargetPoolSize(targetSize) alt targetSize != currentSize EPM ->>+ EPA: resize(targetSize) Note right of EPA: Initiates async scale-up or<br/>graceful scale-down EPA -->>- EPM: Resize Requested (returns) EPM ->> EPM: Update lastScalingTimestamp EPM ->> PM: recordScalingEvent(currentSize, targetSize) else targetSize == currentSize EPM ->> EPM: Log "No scaling needed" end EPM ->> PM: recordScalingLatency(...) EPM -->>- S: Check Complete ``` ### Additional context This is an initial proposal aiming to address the dynamic scaling capabilities of the Engine Pool in Kyuubi. The design and implementation details are still open for discussion. I sincerely welcome feedback, suggestions, and any improvements from the community to help refine and make this feature more robust and aligned with real-world needs. Looking forward to collaborating with everyone! ### Are you willing to submit PR? - [x] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve. - [ ] No. I cannot submit a PR at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org For additional commands, e-mail: notifications-h...@kyuubi.apache.org