[jira] [Comment Edited] (YUNIKORN-2156) Improvement for gang scheduling when spark dynamic allocation enabled

Craig Condit (Jira) Wed, 15 Nov 2023 01:31:53 -0800


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786256#comment-17786256
 ]


Craig Condit edited comment on YUNIKORN-2156 at 11/15/23 9:27 AM:
------------------------------------------------------------------

I’m not sure this is a good idea. For one thing we don’t want any app-specific 
code in the core. We definitively should not be messing with priorities for 
different pods within an application.an app may be scheduled as a hang it 
enforce a minimum viable number of pods for the app to function properly, but 
at some later point demand may spike and additional pods may be needed quite 
quickly (the app may submit high-priority pods). Even within Spark this could 
happen in the case of Spark streaming from something like a Kafka topic that 
has its partition count expanded  due to increasing load. 

Quota should also not impact this - it’s completely valid for users to schedule 
more tasks than they have quota for. Quota is also per-queue and not per ap, so 
adding logic like this just doesn’t make sense. 



These considerations really need to happen within the application framework 
that submits pods (in this case Spark). Only the application itself knows what 
priorities should be used or how to handle the case where pods have been 
requested but not received in a timely manner. 


was (Author: ccondit):
I’m not sure this is a good idea. For one thing we don’t want any app-specific 
code in the core. We definitively should not be messing with priorities for 
different pods within an application.an app may be scheduled as a hang it 
enforce a minimum viable number of pods for the app to function properly, but 
at some later point demand may spike and additional pods may be needed quite 
quickly (the app may submit high-priority pods). Even within Spark this could 
happen in the case of Spark streaming from something like a Kafka topic that 
has its partition count expanded  due to increasing load. 



Quota should also not impact this - it’s completely valid for users to schedule 
more tasks than they have quota for. Quota is also per-queue and not per ap, so 
adding logic like this just doesn’t make sense. 

> Improvement for gang scheduling when spark dynamic allocation enabled
> ---------------------------------------------------------------------
>
>                 Key: YUNIKORN-2156
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2156
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Qi Zhu
>            Priority: Major
>
> Try to improve the case, Spark Dynamic Allocation when we enable gang 
> scheduling:
> If we can add some improvements to this case? Such as:
> *For preemption case:*
> lower priority for anything more than the gang and thus get them killed 
> earlier than when they do not belong to a gang
> *For quota hit:*
> For normal gang scheduling, we will reject jobs with total gang size > quota, 
> but for dynamic allocations, if we can add monitor logic to monitor the 
> exceeded resource allocations for those jobs?
>  * We may extend the quota some value when it happens?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YUNIKORN-2156) Improvement for gang scheduling when spark dynamic allocation enabled

Reply via email to