Re: [PR] Add optional plugins to basic cost function in CostBasedAutoScaler (druid)

via GitHub Thu, 05 Feb 2026 06:23:23 -0800


kfaraz commented on code in PR #18976:
URL: https://github.com/apache/druid/pull/18976#discussion_r2769419823



##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfig.java:
##########
@@ -62,12 +63,20 @@ public class CostBasedAutoScalerConfig implements 
AutoScalerConfig
   private final double idleWeight;
   private final double defaultProcessingRate;
   /**
-   * Represents the threshold value used to prevent the auto-scaler from 
scaling down tasks immediately,
-   * when the computed cost-based metrics fall below this barrier.
-   * A higher value implies a more conservative scaling down behavior, 
ensuring that tasks
-   * are not prematurely terminated in scenarios of potential workload spikes 
or insufficient cost savings.
+   * Enables or disables {@code OptimalTaskCountBoundariesPlugin} which allows
+   * considering only task counts within a certain PPT-based window around the 
current PPT.
    */
-  private final int scaleDownBarrier;
+  private final boolean useTaskCountBoundaries;
+  /**
+   * Per-partition lag threshold allowing to activate a burst scaleup to 
eliminate high lag.
+   */
+  private final int highLagThreshold;
+  /**
+   * Represents the minimum duration between successful scale actions.

Review Comment:
   Please move these javadocs to the getters instead. That way, it would be 
easier for callers to look up the javadocs.



##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java:
##########
@@ -231,12 +222,17 @@ int computeOptimalTaskCount(CostMetrics metrics)
     for (int taskCount : validTaskCounts) {
       CostResult costResult = costFunction.computeCost(metrics, taskCount, 
config);
       double cost = costResult.totalCost();
-      log.debug(
-          "Proposed task count: %d, Cost: %.4f (lag: %.4f, idle: %.4f)",
+      log.info(
+          "Proposed task count[%d] has total cost[%.4f] = lagCost[%.4f] + 
idleCost[%.4f]."
+          + " Stats: avgPartitionLag[%.1f], pollIdleRatio[%.1f], 
lagWeight[%.1f], idleWeight[%.1f]",

Review Comment:
   Please don't log this here. This line will be logged for every proposed task 
count.
   We should log the stats only once.



##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java:
##########
@@ -288,22 +285,33 @@ static int[] computeValidTaskCounts(
 
     IntSet result = new IntArraySet();
     final int currentPartitionsPerTask = partitionCount / currentTaskCount;
-    final int extraIncrease = computeExtraMaxPartitionsPerTaskIncrease(
-        aggregateLag,
-        partitionCount,
-        currentTaskCount,
-        taskCountMax
-    );
-    final int effectiveMaxIncrease = MAX_INCREASE_IN_PARTITIONS_PER_TASK + 
extraIncrease;
 
     // Minimum partitions per task correspond to the maximum number of tasks 
(scale up) and vice versa.
-    final int minPartitionsPerTask = Math.max(1, currentPartitionsPerTask - 
effectiveMaxIncrease);
-    final int maxPartitionsPerTask = Math.min(
-        partitionCount,
-        currentPartitionsPerTask + MAX_DECREASE_IN_PARTITIONS_PER_TASK
-    );
+    int minPartitionsPerTask = partitionCount / taskCountMax;
+    int maxPartitionsPerTask = partitionCount / taskCountMin;

Review Comment:
   Should we clamp these to the limits of [1, partitionCount]. Otherwise, they 
may overflow those bounds.



##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java:
##########
@@ -314,32 +322,46 @@ static int[] computeValidTaskCounts(
 
   /**
    * Computes extra allowed increase in partitions-per-task in scenarios when 
the average per-partition lag
-   * is above the configured threshold. By default, it is {@code 
EXTRA_SCALING_ACTIVATION_LAG_THRESHOLD}.
-   * Generally, one of the autoscaler priorities is to keep the lag as close 
to zero as possible.
+   * is above the configured threshold.
+   * <p>
+   * This uses a logarithmic formula for consistent absolute growth:
+   * {@code deltaTasks = K * ln(lagSeverity)}
+   * where {@code K = (partitionCount / 6.4) / sqrt(currentTaskCount)}
+   * <p>
+   * This ensures that small taskCount's get a massive relative boost,
+   * while large taskCount's receive more measured, stable increases.
    */
-  static int computeExtraMaxPartitionsPerTaskIncrease(
+  static int computeExtraPPTIncrease(

Review Comment:
   Please simplify this method by returning the max allowed task count itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add optional plugins to basic cost function in CostBasedAutoScaler (druid)

Reply via email to