Fly-Style commented on code in PR #18936:
URL: https://github.com/apache/druid/pull/18936#discussion_r2721936785


##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java:
##########
@@ -106,10 +110,14 @@ public CostResult computeCost(CostMetrics metrics, int 
proposedTaskCount, CostBa
 
 
   /**
-   * Estimates the idle ratio for a given task count using a capacity-based 
linear model.
+   * Estimates the idle ratio for a proposed task count.
+   * Includes lag-based adjustment to eliminate high lag and

Review Comment:
   > Also, in addition to the above, I think adding in this lag consideration 
does add some complexity here. Mainly it generally starts us down the path of 
making the cost function harder to easily and quickly understand for a 
newcomer, IMO. 
   
   Sometimes you want to have complex things in the project, because they make 
some things work slightly better. A good example is query planner / query 
optimizer, which we have from the Calcite side. It's not easy to enter, hard to 
master, but with complexity it brings a good framework to start using SQL for 
your database.
   Same here: in order to make supervisor autoscaling work well, we need to 
introduce a level of complexity baked by math (the formulas are described here: 
https://github.com/apache/druid/pull/18819). During the testing, I realized it 
is too conservative in the high lag scenarios, and it is an attempt to tweak it 
a bit.
   
   _I hook up your question from general comment:_
   > I also wonder if a more in depth technical writeup once this feature is 
stabilized is in order. Something that explains the method to the madness and a 
bit of the math. Perhaps in a brief blog post or docs page within the apache 
Druid website?
   
   We must do it, but the feature is not finally stabilized; anyway, it already 
has a decent base.
   
   > So I guess that begs the question, how did we or are we going to measure 
the improvement that this additional logic/computation provides?
   
   That's a very good question, and I would answer in the following manner: the 
less time we spend scaling supervisors manually / fine-tuning an autoscaler, 
the better the result we will receive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to