jtuglu1 commented on code in PR #19091: URL: https://github.com/apache/druid/pull/19091#discussion_r2907108205
########## indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java: ########## @@ -62,10 +63,17 @@ public class CostBasedAutoScaler implements SupervisorTaskAutoScaler public static final String LAG_COST_METRIC = "task/autoScaler/costBased/lagCost"; public static final String IDLE_COST_METRIC = "task/autoScaler/costBased/idleCost"; public static final String OPTIMAL_TASK_COUNT_METRIC = "task/autoScaler/costBased/optimalTaskCount"; + public static final String INVALID_METRICS_COUNT = "task/autoScaler/costBased/invalidMetrics"; static final int MAX_INCREASE_IN_PARTITIONS_PER_TASK = 2; static final int MAX_DECREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2; + /** + * If average partition lag crosses this value and the processing rate is + * still zero, scaling actions are skipped and an alert is raised. + */ + static final int MAX_IDLENESS_PARTITION_LAG = 10_000; Review Comment: > But if the lag exceeds this value AND processing rate is zero, that indicates something is wrong with the tasks. I guess my point is we have topics where exceeding 10k is probably too late to detect something is up (we've already broken an SLO). We can leave it for now to avoid config bloat, but I don't really like to hardcode this stuff. IMO, when we start to add more tweakable configs/magic numbers to the solution it points at a larger underlying issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
