[I] EngineConn distribution imbalance across ECMs during creation [linkis]

via GitHub Sat, 20 Dec 2025 06:53:34 -0800


kinghao007 opened a new issue, #5322:
URL: https://github.com/apache/linkis/issues/5322


   ## Summary
   
   When creating EngineConn (EC) instances, the distribution may become uneven 
across EngineConnManager (ECM) nodes. ECs tend to be created on the same or a 
small subset of ECMs, even when other ECMs have abundant available resources. 
This leads to high load on specific ECM nodes while others remain idle.
   
   ## Problem Description
   
   In production environments with multiple ECM nodes, users have observed that:
   
   1. **Uneven EC distribution**: New EC instances are frequently created on 
the same ECM or a small number of ECMs
   2. **Resource underutilization**: Some ECMs remain idle while others are 
overloaded
   3. **Single point of high load**: One ECM may become a bottleneck, affecting 
overall system performance
   4. **Inefficient cluster utilization**: The full cluster capacity is not 
being leveraged
   
   ## Current Implementation Analysis
   
   The current ECM selection mechanism in 
`DefaultEngineCreateService.selectECM()` uses a multi-rule chain approach:
   
   | Rule | Priority | Function |
   |------|----------|----------|
   | ScoreNodeSelectRule | 0 | Sort by label matching score |
   | AvailableNodeSelectRule | 2 | Filter unavailable nodes |
   | OverLoadNodeSelectRule | 3 | Sort by memory usage rate |
   | ResourceNodeSelectRule | 5 | Sort by remaining resources |
   | NewECMStandbyRule | 7 | New ECM cooldown period |
   | HotspotExclusionRule | MaxValue | Random shuffle top nodes |
   
   **Key Files**:
   - 
`linkis-application-manager/.../service/engine/DefaultEngineCreateService.scala`
   - `linkis-application-manager/.../selector/DefaultNodeSelector.scala`
   - `linkis-application-manager/.../selector/rule/HotspotExclusionRule.scala`
   
   ## Root Cause Analysis
   
   Despite the existing HotspotExclusionRule, the following issues may cause 
distribution imbalance:
   
   1. **Delayed Resource Reporting**
      - Health report period is 10 seconds by default 
(`wds.linkis.ecm.health.report.period`)
      - During high concurrency, multiple requests may see stale resource data
      - All concurrent requests may choose the same "best" ECM
   
   2. **Limited Random Scope**
      - HotspotExclusionRule only shuffles top 5 nodes
      - If fewer than 5 ECMs are available, randomization is ineffective
      - Label filtering may reduce available ECM pool significantly
   
   3. **Deterministic Scoring**
      - Label matching scores may consistently favor certain ECMs
      - Resource sorting is deterministic, leading to repeated selection of the 
same node
   
   4. **Concurrent Race Condition**
      - Multiple EC creation requests arriving simultaneously
      - All see the same snapshot of cluster state
      - All select the same ECM before resource updates propagate
   
   ## Proposed Solutions
   
   ### Option 1: Enhanced Weighted Random Selection
   
   Implement weighted random selection based on available resources:
   
   ```scala
   // Pseudocode
   def selectECMWeighted(ecmNodes: Seq[EMNode]): EMNode = {
     val weights = ecmNodes.map(node => node.leftResource.weight)
     val totalWeight = weights.sum
     val random = Random.nextDouble() * totalWeight
   
     var cumulative = 0.0
     ecmNodes.zip(weights).find { case (node, weight) =>
       cumulative += weight
       cumulative >= random
     }.map(_._1).getOrElse(ecmNodes.head)
   }
   ```
   
   ### Option 2: Distributed Counter with Lease
   
   Implement a distributed counter mechanism to track pending EC creations:
   
   - Before selecting ECM, acquire a "reservation" in distributed cache
   - Include pending reservations in resource calculation
   - Release reservation after EC creation completes or fails
   
   ### Option 3: Real-time Resource Synchronization
   
   Reduce resource reporting delay and implement push-based updates:
   
   - Decrease `wds.linkis.ecm.health.report.period` to 1-2 seconds
   - Implement event-driven resource updates when EC is created/destroyed
   - Use optimistic locking for concurrent selection
   
   ### Option 4: Adaptive Load Balancing
   
   Implement adaptive selection with feedback:
   
   - Track actual EC creation success/failure rates per ECM
   - Apply penalties to frequently selected ECMs
   - Implement exponential backoff for overloaded nodes
   
   ## Acceptance Criteria
   
   1. In concurrent EC creation scenarios (e.g., 100 concurrent requests), EC 
distribution across ECMs should be relatively even (variance < 20% of mean)
   2. No single ECM should receive more than 150% of the average load
   3. All existing functionality must remain intact
   4. Performance overhead should be minimal (< 5% latency increase)
   5. Solution should be configurable and backward compatible
   
   ## Test Plan
   
   1. **Unit Tests**
      - Test weighted random selection algorithm
      - Test resource calculation with pending reservations
      - Test edge cases (single ECM, all ECMs at capacity)
   
   2. **Integration Tests**
      - Simulate concurrent EC creation requests
      - Verify even distribution across ECM nodes
      - Test failover when selected ECM becomes unavailable
   
   3. **Performance Tests**
      - Measure latency impact of new selection logic
      - Verify scalability with increasing ECM count
      - Benchmark under high concurrency
   
   ## Related Configuration
   
   | Config Key | Default | Description |
   |------------|---------|-------------|
   | wds.linkis.ecm.health.report.period | 10s | ECM health report interval |
   | linkis.node.select.hotspot.exclusion.rule.enable | true | Enable hotspot 
exclusion |
   | wds.linkis.manager.am.em.new.wait.mills | 60000 | New ECM cooldown period |
   
   ## References
   
   - ECM Architecture: 
https://linkis.apache.org/docs/latest/architecture/computation-governance-services/engine/engine-conn-manager
   - EngineConn Overview: 
https://linkis.apache.org/docs/latest/architecture/computation-governance-services/engine/engine-conn


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] EngineConn distribution imbalance across ECMs during creation [linkis]

Reply via email to