[I] [FEATURE] Dynamic Engine Pool Scaling [kyuubi]

via GitHub Mon, 28 Apr 2025 08:50:40 -0700


wangzhigang1999 opened a new issue, #7050:
URL: https://github.com/apache/kyuubi/issues/7050


   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [x] I have searched in the 
[issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Describe the feature
   
   This feature aims to introduce dynamic resizing (scaling up and down) 
capabilities to Kyuubi's Engine Pool mechanism. By introducing a pluggable 
`PoolScalingStrategy`, a coordinator manager `EnginePoolManager`, an accessor 
interface `EnginePoolAccessor` to interact with specific pool implementations, 
and corresponding monitoring metrics `PoolMetrics`, the size of the Engine Pool 
can automatically adjust based on predefined policies—such as time-based rules 
and, potentially in the future, load-based strategies.
   
   ### Motivation
   
   **Current Situation:**
   
   Kyuubi’s Engine Pool pre-starts and manages a set of engine instances for 
specific users or groups to reduce session creation latency. Currently, the 
size of each pool is statically configured via `kyuubi.engine.pool.size`, 
requiring administrators to manually set and adjust this size based on expected 
loads.
   
   **Problems:**
   
   1. **Resource Waste:** A statically large pool during low-load periods 
leaves many engines idle, consuming CPU, memory, and possibly licensing 
resources unnecessarily.
   2. **Performance Bottlenecks and User Experience Degradation:** During peak 
traffic, a statically small pool may be insufficient to handle concurrency, 
causing session requests to queue or even time out, hurting throughput and user 
experience.
   3. **Operational Complexity:** Manual monitoring and resizing based on 
experience is inefficient, reactive, error-prone, and ill-suited for complex or 
rapidly changing load patterns.
   4. **Inadequate Cloud-Native Adaptability:** Static sizing cannot leverage 
cloud elasticity effectively, limiting dynamic resource allocation per actual 
demand—contrary to cloud-native principles.
   
   **Real-World Scenario:**
   We have a big data cluster submitting jobs via Kyuubi with day-night 
resource elasticity (e.g., 8–10 TB memory resident during the day and roughly 
1.5× expanded at night), engine pools are used for certain heavy users:
   
   - **Option 1 (2 engines, 48GB RAM each):** Under nighttime peak, the few 
engines become bottlenecks, prone to GC pauses; often an engine hangs on GC 
causing jobs to be delayed by 1-2 hours and severely increasing night-time 
operational burden.
   - **Option 2 (4 engines, 32GB RAM each):** Can barely handle nighttime but 
engines remain occupied during daytime due to ongoing sessions and Spark 
Dynamic Allocation executor caching, wasting resources and affecting other 
non-Kyuubi jobs.
   
   This illustrates the limitations of static pools facing dynamic resources 
and varying loads, underscoring the urgent need for automated elastic scaling.
   
   **Goals and Benefits:**
   
   - **Improve Resource Efficiency:** Automatically shrink pool size during low 
load to free idle resources and cut costs.
   - **Enhance System Resilience:** Expand pool size proactively in high load 
to promptly respond to user demand, ensuring service performance and 
availability.
   - **Increase Adaptability:** Enable Kyuubi Engine Pools to automatically 
adapt to periodic or bursty workload fluctuations.
   - **Simplify Operations:** Reduce manual intervention and management 
complexity with automated scaling.
   - **Better Cloud-Native Support:** Leverage cloud platform elasticity for 
on-demand resource allocation.
   
   ### Describe the solution
   
   We introduce a set of new components and configurations to enable dynamic 
resizing of Engine Pools. The core architecture revolves around instantiating 
an `EnginePoolManager` for each pool (or sub-pool) requiring dynamic scaling. 
This manager periodically runs (as per `scaling.interval`), computes the target 
size through a configurable `PoolScalingStrategy`, and interacts with the 
concrete pool implementation via an `EnginePoolAccessor` to carry out scale-up 
or **graceful scale-down** operations. A cooldown period is enforced to 
stabilize scaling, and detailed metrics are exposed through `PoolMetrics`.
   
   #### Core Components
   
   1. **`EnginePoolManager`**
      - **Responsibilities:** Manages the dynamic scaling lifecycle for a 
single Engine Pool identified by `poolIdentifier` (e.g., user/group key). Runs 
periodic scaling checks using a scheduled executor. It retrieves the current 
pool size and optional metrics via `EnginePoolAccessor`, calculates the desired 
size via `PoolScalingStrategy`, respects the cooldown period (bypassing scaling 
if within cooldown), and triggers resize operations as needed. It logs and 
reports scaling events, target and actual sizes, latencies, and errors to 
`PoolMetrics` and logs.
      - **Lifecycle:** Tied to the Engine Pool instance in the Kyuubi server, 
created and started on server/pool startup, and gracefully stopped on shutdown 
or pool destruction.
   
   2. **`PoolScalingStrategy`** (Pluggable Interface)
      - **Responsibilities:** Defines the core logic for computing the target 
pool size. Must be stateless or serializable if needed for configuration 
distribution. Receives a `PoolContext` with pool ID, current time, current 
size, min/max bounds, and optional load/performance metrics collected from the 
pool. Returns a desired target size, ideally within min/max bounds (final 
bounds enforcement is done by `EnginePoolManager`).
      - Allows users to implement and plug in custom scaling algorithms.
   
   3. **`EnginePoolAccessor`** (Interface to the Pool Implementation)
      - **Responsibilities:** Abstracts interaction with the concrete Engine 
Pool implementations (`EnginePool`, `UserGroupAwareEnginePool`, etc.). Provides:
        - Precise retrieval of the **current effective pool size** (excluding 
starting or pending-removal engines).
        - Execution of **scale-up** commands (for example, creating new engines 
asynchronously).
        - Collection of internal metrics useful to scaling decisions (active 
sessions, pending sessions, idle engines, pending removal counts, etc.).
   
   4. **`PoolMetrics`** Interface
      - **Responsibilities:** Defines APIs to report and monitor dynamic 
scaling activities such as current and target pool sizes, scaling events 
(scale-ups and downs), scaling latencies, and errors.
      - A default implementation will integrate with Kyuubi’s existing 
`MetricsSystem` to register gauges, counters, timers, etc., with appropriate 
labels to distinguish pools.
   
   
   ```mermaid
   sequenceDiagram
       title Dynamic Scaling Check Sequence
   
       participant S as Scheduler (in Manager)
       participant EPM as EnginePoolManager
       participant EPA as EnginePoolAccessor
       participant PSS as PoolScalingStrategy
       participant PM as PoolMetrics
   
       S ->>+ EPM: Trigger Scaling Check (Every Interval)
       EPM ->> EPM: Check Cooldown Period
       opt Cooldown Active
           EPM -->> S: Skip Check (In Cooldown)
       end
       EPM ->>+ EPA: getCurrentSize()
       EPA -->>- EPM: currentSize
       EPM ->> PM: recordPoolSize(currentSize)
       EPM ->>+ EPA: collectMetrics()
       EPA -->>- EPM: poolMetricsMap
       EPM ->> EPM: Create PoolContext(currentSize, poolMetricsMap, ...)
       EPM ->>+ PSS: calculateTargetSize(context)
       PSS -->>- EPM: targetSizeRaw
       EPM ->> EPM: Clamp targetSize = max(minSize, min(maxSize, targetSizeRaw))
       EPM ->> PM: recordTargetPoolSize(targetSize)
   
       alt targetSize != currentSize
           EPM ->>+ EPA: resize(targetSize)
           Note right of EPA: Initiates async scale-up or<br/>graceful 
scale-down
           EPA -->>- EPM: Resize Requested (returns)
           EPM ->> EPM: Update lastScalingTimestamp
           EPM ->> PM: recordScalingEvent(currentSize, targetSize)
       else targetSize == currentSize
           EPM ->> EPM: Log "No scaling needed"
       end
   
       EPM ->> PM: recordScalingLatency(...)
       EPM -->>- S: Check Complete
   
   ```
   
   ### Additional context
   
   This is an initial proposal aiming to address the dynamic scaling 
capabilities of the Engine Pool in Kyuubi. The design and implementation 
details are still open for discussion. I sincerely welcome feedback, 
suggestions, and any improvements from the community to help refine and make 
this feature more robust and aligned with real-world needs. Looking forward to 
collaborating with everyone!
   
   
   
   ### Are you willing to submit PR?
   
   - [x] Yes. I would be willing to submit a PR with guidance from the Kyuubi 
community to improve.
   - [ ] No. I cannot submit a PR at this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org
For additional commands, e-mail: notifications-h...@kyuubi.apache.org

[I] [FEATURE] Dynamic Engine Pool Scaling [kyuubi]

Reply via email to