[ 
https://issues.apache.org/jira/browse/YUNIKORN-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921450#comment-17921450
 ] 

Craig Condit commented on YUNIKORN-3007:
----------------------------------------

I disagree 100% with removing reservations. However, I think there is room for 
some design work on improving how they work and providing ways to customize 
that behavior. I propose we use this Jira issue to have a discussion on that.

> Improve YuniKorn reservation logic
> ----------------------------------
>
>                 Key: YUNIKORN-3007
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3007
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Rainie Li
>            Assignee: Rainie Li
>            Priority: Major
>         Attachments: queue.yaml, test-job1.yaml, test-job2.yaml, 
> test-job3.yaml
>
>
> *Issue and Investigation:*
> We’ve observed spark job slowness issues on our prod cluster, especially when 
> large jobs are running on the cluster. This performance degradation impacts 
> user experience.
> When High cluster utilization with numerous pending pods, could cause  large 
> jobs that arrive first to reserve resources on nodes. This reservation 
> mechanism prevents new jobs from getting necessary resources, which agains 
> preemption.
> *Test case:*
> Pls refer to attached files. 
>  # Submit test-job1 to queue-one
>  # Once test-job1 is running, Submit test-job2 to queue-two
>  # Once test-job2 is running and pending memory reaches to more than 40TB, 
> Submit test-job3 to queue-three
> *Proposal:*
> YuniKorn incorporates multiple scenarios for making reservations. To address 
> the current issue, we propose retaining only the preemption-related 
> reservations, as preemption relies on reservations to ensure that resources 
> can be reallocated later.
> The rationale for removing other reservation scenarios is as follows:
>  # If a queue's usage exceeds its guaranteed resources, it should not 
> maintain reservations.
>  # Conversely, if a queue's usage falls below its guaranteed resources, it 
> should be able to secure resources through preemption.
> *Our fix:* 
> We applied the fix internally to remove allocation case here 
> [https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/application.go#L1532]
>  
>  
> Seems reservation 
> [https://yunikorn.apache.org/release-announce/0.8.0/#resource-reservation] is 
> by design, but in our case it agains preemption
>  I would like to open this ticket to have a follow up discussion with the 
> community to see what will be the better solution to address this issue.  cc 
> [~wilfreds] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to