[jira] [Comment Edited] (YUNIKORN-1934) Guaranteed quota distribution

Wilfred Spiegelenburg (Jira) Thu, 24 Aug 2023 18:17:03 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758786#comment-17758786
 ]


Wilfred Spiegelenburg edited comment on YUNIKORN-1934 at 8/25/23 1:16 AM:
--------------------------------------------------------------------------

gianluca perna  [16 hours 
ago|https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1692868430814269]
{quote}Hello everyone!
We are installing YuniKorn on our RKE cluster because we would like to manage 
Spark using the YuniKorn scheduler. We have installed and configured 
everything, which seems to be working for now, but we have some questions that 
hopefully can find answers here.
1) When we create queues, is it possible to define resources using percentages 
instead of fixed numbers? It would be helpful in case we scale up the cluster 
without having to reconfigure everything.
2) We have noticed that the only way we managed to distribute resources using 
preemption is as follows.
Let’s assume we have a resource pool of size 100.
User 1 arrives in the cluster, submits the Spark job, and takes all the 
resources.
Then comes User 2, submits the job, and preemption “takes away” the guaranteed 
resources defined for User 1 and assigns them to User 2.
Is there a way to use a 1/N policy? Meaning, when User 2 arrives, both get 50% 
of the resources, with User 3 getting 33% and so on?
3) Another question, to manage all our users with preemption, we created a 
“spark” queue under the root queue, and under this, a queue for each individual 
user. Guaranteed resources were assigned to each user, so that everyone could 
have some computing power in the worst case. However, we noticed that initially 
the mechanism was being rejected because in the spark queue we defined the max 
resources as the actual resources of our cluster, while the sum of the 
guaranteed resources per user was greater than the max resources of the spark 
queue. As a workaround, we set the maximum values of the spark queue to 
enormous values, so that the sum of guaranteed resources for users would never 
reach that limit. What would be the best practice? I am attaching a photo for 
the third point.
Thank you very much!
{quote}
!Screenshot 2023-08-24 at 11.07.28.png|width=424,height=100!
!Screenshot 2023-08-24 at 11.07.41.png|width=419,height=106!
 
 
Wilfred Spiegelenburg
{quote}1: there is an open jira for that, Rainie is working on it.
2: no, something like that was discussed but nothing came of that yet as it was 
really complex to implement
3: my recommendation: do not set a max size on the spark queue, the root queue 
will reflect the overall size of you r cluster  already.Using percentages in 
your setup with a fixed cluster would be nice. Make sure to add a comment to 
the jira:
{quote}
{quote}https://issues.apache.org/jira/browse/YUNIKORN-1728
{quote}
 
gianluca perna 
{quote}Hi Wilfred, first of all thanks a lot for yours hints and time.
About point 2, can I ask you why it is complicated? I mean, it seems to me that 
something similar is already implemented, the problem is just on the resources 
that you need to free.
To be clear, at the moment, what you do with preemption is to free an amount of 
resources that is the “Guaranted resources” specified by the user-queue, so, 
based on the values that you read in the config.
At this point, if the queue root knows about the total resources, why you don’t 
use a counter of the active users to divide total resources by that factor?
It is at the end the same stuff, but just with a more aggressive preemption in 
that case.About point 3, we tried to leave the spark queue without any value, 
but it seems that if you create a lot of queue (one per user in our case), if 
the sum of all the inner-queue in terms of guaranteed resources is grater then 
the maximum cluster resources, a priori the system refuse to add all the 
desired queue.
Maybe we are wrong in something, I’ll try again this afternoon a test.Thanks a 
lot!
{quote}
 
Wilfred Spiegelenburg
{quote}Guaranteeing more resources than that are available in the cluster is a 
problem. At that point you create a state that might never become stable. 
Preemption would in a number of cases not be able to get that guarantee. Which 
means that you really are back to just the FIFO or Priority based 
scheduling.With 2 it would always be based on queues. We do not guarantee 
resources for a user. That means the first assumption would be one queue per 
user. It would thus be on active queues not directly users
Just dividing the cluster in pieces like that could leave you with really small 
guarantees for each queue. Really small guarantees do not work. Specially when 
the guarantee becomes as small as a single allocation. The other assumption you 
made is that there is nothing besides these user queues. If you have a mixed 
setup with some user based queues and some mixed load queues with guarantees it 
becomes  complex.The last point around percentages is that some resources, like 
GPUs, are low in count and not splittable. You don’t have GPUs in similar 
numbers as memory or cpu. Plus 1/6 of a GPU is not possible it it either 0 or 
1.  Different types different handling…
{quote}
 
gianluca perna  
{quote}Understood, thank you. So basically, what would be the right approach in 
our case using YuniKorn?
I mean, we have hundreds of users in our cluster, who are clearly not active at 
the same time. So our idea was to split the guaranteed resources a little bit, 
but always in such a way that the sum of the guaranteed resources per user was 
greater than the total of the cluster. This is because it is generally a rare 
occurrence to see more than 30 percent of the users active at the same time.
Is so that wrong to create a queue per user?
Thanks a lot for your patience, your help is really appreciated
{quote}
 
Sunil Govindan
{quote}[@Wilfred 
Spiegelenburg|https://yunikornworkspace.slack.com/team/ULRU2BU6B] can they use 
dynamic queues per user which dynamic max capacity with a defined guarantee 
capacity?
{quote}
Wilfred Spiegelenburg
{quote}I still think the % approach is the right way on a per user queue.
I think we need to combine that with a minimum for the resulting value we 
calculate based on that %. Result should never be less than 20GB/1 cpu etc.
Exclude the percentage based quota from the size check. We already do that 
implicitly when we use a child template. In that case we mostly circumvent the 
whole more guaranteed than available case.
We might need to combine that with the limitation that we do not allow mixing 
of fixed and percentage based values at the same level in the tree.
One other thing we can think of is allowing an oversubscribed guaranteed 
quota.Capturing all this in a Jira
{quote}
Wilfred Spiegelenburg
{quote}One point I did not answer was the dynamic queue point from Sunil:
It would reduce the number of active queues and each queue would only exist for 
as long as there is a workload in that queue. That would help working around 
the check for the guaranteed total. A child template would need to be used. It 
supports guaranteed quotas.
That does imply using placement rules etc also.

The child template indirectly allows oversubscribing guaranteed. Not sure how 
the scheduler performs in the case that happens. Preemption most likely will 
not work in all cases as expected.
{quote}


was (Author: wifreds):
gianluca perna  [16 hours 
ago|https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1692868430814269]

{quote}Hello everyone!
We are installing YuniKorn on our RKE cluster because we would like to manage 
Spark using the YuniKorn scheduler. We have installed and configured 
everything, which seems to be working for now, but we have some questions that 
hopefully can find answers here.
1) When we create queues, is it possible to define resources using percentages 
instead of fixed numbers? It would be helpful in case we scale up the cluster 
without having to reconfigure everything.
2) We have noticed that the only way we managed to distribute resources using 
preemption is as follows.
Let’s assume we have a resource pool of size 100.
User 1 arrives in the cluster, submits the Spark job, and takes all the 
resources.
Then comes User 2, submits the job, and preemption “takes away” the guaranteed 
resources defined for User 1 and assigns them to User 2.
Is there a way to use a 1/N policy? Meaning, when User 2 arrives, both get 50% 
of the resources, with User 3 getting 33% and so on?
3) Another question, to manage all our users with preemption, we created a 
“spark” queue under the root queue, and under this, a queue for each individual 
user. Guaranteed resources were assigned to each user, so that everyone could 
have some computing power in the worst case. However, we noticed that initially 
the mechanism was being rejected because in the spark queue we defined the max 
resources as the actual resources of our cluster, while the sum of the 
guaranteed resources per user was greater than the max resources of the spark 
queue. As a workaround, we set the maximum values of the spark queue to 
enormous values, so that the sum of guaranteed resources for users would never 
reach that limit. What would be the best practice? I am attaching a photo for 
the third point.
Thank you very much!{quote}
!Screenshot 2023-08-24 at 11.07.28.png|width=424,height=100!
!Screenshot 2023-08-24 at 11.07.41.png|width=419,height=106!
 
 
Wilfred Spiegelenburg

{quote}1: there is an open jira for that, Rainie is working on it.
2: no, something like that was discussed but nothing came of that yet as it was 
really complex to implement
3: my recommendation: do not set a max size on the spark queue, the root queue 
will reflect the overall size of you r cluster  already.Using percentages in 
your setup with a fixed cluster would be nice. Make sure to add a comment to 
the jira:{quote}
{quote}https://issues.apache.org/jira/browse/YUNIKORN-1728{quote}
 
gianluca perna 

{quote}Hi Wilfred, first of all thanks a lot for yours hints and time.
About point 2, can I ask you why it is complicated? I mean, it seems to me that 
something similar is already implemented, the problem is just on the resources 
that you need to free.
To be clear, at the moment, what you do with preemption is to free an amount of 
resources that is the “Guaranted resources” specified by the user-queue, so, 
based on the values that you read in the config.
At this point, if the queue root knows about the total resources, why you don’t 
use a counter of the active users to divide total resources by that factor?
It is at the end the same stuff, but just with a more aggressive preemption in 
that case.About point 3, we tried to leave the spark queue without any value, 
but it seems that if you create a lot of queue (one per user in our case), if 
the sum of all the inner-queue in terms of guaranteed resources is grater then 
the maximum cluster resources, a priori the system refuse to add all the 
desired queue.
Maybe we are wrong in something, I’ll try again this afternoon a test.Thanks a 
lot!{quote}
 
Wilfred Spiegelenburg

{quote}Guaranteeing more resources than that are available in the cluster is a 
problem. At that point you create a state that might never become stable. 
Preemption would in a number of cases not be able to get that guarantee. Which 
means that you really are back to just the FIFO or Priority based 
scheduling.With 2 it would always be based on queues. We do not guarantee 
resources for a user. That means the first assumption would be one queue per 
user. It would thus be on active queues not directly users
Just dividing the cluster in pieces like that could leave you with really small 
guarantees for each queue. Really small guarantees do not work. Specially when 
the guarantee becomes as small as a single allocation. The other assumption you 
made is that there is nothing besides these user queues. If you have a mixed 
setup with some user based queues and some mixed load queues with guarantees it 
becomes  complex.The last point around percentages is that some resources, like 
GPUs, are low in count and not splittable. You don’t have GPUs in similar 
numbers as memory or cpu. Plus 1/6 of a GPU is not possible it it either 0 or 
1.  Different types different handling…{quote}
 
gianluca perna  

{quote}Understood, thank you. So basically, what would be the right approach in 
our case using YuniKorn?
I mean, we have hundreds of users in our cluster, who are clearly not active at 
the same time. So our idea was to split the guaranteed resources a little bit, 
but always in such a way that the sum of the guaranteed resources per user was 
greater than the total of the cluster. This is because it is generally a rare 
occurrence to see more than 30 percent of the users active at the same time.
Is so that wrong to create a queue per user?
Thanks a lot for your patience, your help is really appreciated{quote}
 
Sunil Govindan

{quote}[@Wilfred 
Spiegelenburg|https://yunikornworkspace.slack.com/team/ULRU2BU6B] can they use 
dynamic queues per user which dynamic max capacity with a defined guarantee 
capacity?{quote}
Wilfred Spiegelenburg

{quote}I still think the % approach is the right way on a per user queue.
I think we need to combine that with a minimum for the resulting value we 
calculate based on that %. Result should never be less than 20GB/1 cpu etc.
Exclude the percentage based quota from the size check. We already do that 
implicitly when we use a child template. In that case we mostly circumvent the 
whole more guaranteed than available case.
We might need to combine that with the limitation that we do not allow mixing 
of fixed and percentage based values at the same level in the tree.
One other thing we can think of is allowing an oversubscribed guaranteed 
quota.Capturing all this in a jira{quote}

> Guaranteed quota distribution
> -----------------------------
>
>                 Key: YUNIKORN-1934
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1934
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Wilfred Spiegelenburg
>            Priority: Major
>         Attachments: Screenshot 2023-08-24 at 11.07.28.png, Screenshot 
> 2023-08-24 at 11.07.41.png
>
>
> Discussion on slack around guaranteed quota distribution full discussion in 
> the comments.
> Main points:
>  * percentage for guaranteed quota
>  * limitation of sum of guaranteed quota for queues to the cluster size when 
> not all queues are active
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YUNIKORN-1934) Guaranteed quota distribution

Reply via email to