[ 
https://issues.apache.org/jira/browse/YUNIKORN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2212:
-----------------------------------
    Description: 
In every second, we collect "outstanding requests", ie those which cannot be 
scheduled.

Problem is, the scheduling cycle might not even tried to schedule those pods. 
If that's the case, we mistakenly set them to "Unschedulable", which can 
trigger autoscaling if the cluster autoscaler happens to run at the beginning 
of the next scan interval.

Another thing to consider is when we need to mark them as Unschedulable. Eg. 
tryPreemption() succeeded, do we still need new nodes? This can addressed in a 
separate JIRA.

This issue also shows up during performance testing. Since we submit a lot of 
pods to Yunikorn, {{Scheduler.inspectOutstandingRequests()}} finds&collects 
them and subsequently generates a lot of API server updates. This is a special 
edge case, but on busy clusters, something similar can happen.

*Ticket scope:*
- Mark asks if they have been attempted to be scheduled. Collect those which 
are marked.
- Mark asks as soon as they've been updated to Unschedulable. Next time, don't 
collect these. Headrooms should still be calculated.


  was:
In every second, we collect "outstanding requests", ie those which cannot be 
scheduled.

Problem is, the scheduling cycle might not even tried to schedule those pods. 
If that's the case, we mistakenly set them to "Unschedulable", which can 
trigger autoscaling if the cluster autoscaler happens to run at the beginning 
of the next scan interval.

Another thing to consider is when we need to mark them as Unschedulable. Eg. 
tryPreemption() succeeded, do we still need new nodes? This can addressed in a 
separate JIRA.

This issue also shows up during performance testing. Since we submit a lot of 
pods to Yunikorn, {{Scheduler.inspectOutstandingRequests()}} finds&collects 
them and subsequently generates a lot of API server updates. This is a special 
edge case, but on busy clusters, something similar can happen.



> Don't collect requests that hasn't been scheduled yet or already triggered 
> scale up
> -----------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2212
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2212
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: outstandingRequests-1.png
>
>
> In every second, we collect "outstanding requests", ie those which cannot be 
> scheduled.
> Problem is, the scheduling cycle might not even tried to schedule those pods. 
> If that's the case, we mistakenly set them to "Unschedulable", which can 
> trigger autoscaling if the cluster autoscaler happens to run at the beginning 
> of the next scan interval.
> Another thing to consider is when we need to mark them as Unschedulable. Eg. 
> tryPreemption() succeeded, do we still need new nodes? This can addressed in 
> a separate JIRA.
> This issue also shows up during performance testing. Since we submit a lot of 
> pods to Yunikorn, {{Scheduler.inspectOutstandingRequests()}} finds&collects 
> them and subsequently generates a lot of API server updates. This is a 
> special edge case, but on busy clusters, something similar can happen.
> *Ticket scope:*
> - Mark asks if they have been attempted to be scheduled. Collect those which 
> are marked.
> - Mark asks as soon as they've been updated to Unschedulable. Next time, 
> don't collect these. Headrooms should still be calculated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to