maytasm opened a new pull request #11440:
URL: https://github.com/apache/druid/pull/11440


   Improve Auto scaler pendingTaskBased provisioning strategy to handle when 
there are no currently running worker node better
   
   ### Description
   
   As described in https://github.com/apache/druid/issues/10918, the 
PendingTaskBasedWorkerProvisioningStrategy of the Auto scaler does not work 
well when there are 0 worker node running. The problems are the following:
   1. When there are 0 worker node running, currently the auto scaler will 
first scale up to minWorkerCount and only in the next provisioning cycle would 
be able to determine the correct number of workers needed to run all pending 
tasks. This is inefficient as we will have to go through two provisioning cycle 
plus the time it takes for the worker nodes in the first provisioning to be up 
and running before being able to scale to the correct number (basically it 
would take twice as long as needed)
   2. When the minWorkerCount is set to 0 and there are 0 worker node running, 
the autoscaler will never attempts to add more instances. This is because the 
auto scaler will try to scale to minWorkerCount (which is 0). Hence, pending 
task will not be able to run. 
   
   The reason for the auto scaler scaling to minWorkerCount first is because 
without any running worker node, the auto scaler will not be able to determine 
the capacity per worker. (note even when there are running worker nodes, that 
the auto scaler assume that all worker nodes have the same capacity and use the 
capacity of the first running node). 
   
   To fix this problem, I introduce a new config in the 
PendingTaskBasedWorkerProvisioningConfig under 
`druid.indexer.autoscale.workerCapacityFallback`. This config tells the auto 
scaler the worker capcity for determining number of workers needed when there 
are currently no worker running. If unset or null, auto scaler will scale to 
`minNumWorkers` in autoScaler config instead. Note: this config is only 
applicable to `pendingTaskBased` provisioning strategy. Even if this config 
value is not accurate (i.e. if your worker node capacity changed over time) it 
is still useful for solving problem #2 above, as the auto scaler will at least 
provision some nodes and in the next providing cycle will be able to determine 
the correct number of workers needed (rather than being stuck at 0 workers 
forever). 
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [x] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to