Rainie Li created YUNIKORN-3007:
-----------------------------------

             Summary: Improve YuniKorn reservation logic
                 Key: YUNIKORN-3007
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3007
             Project: Apache YuniKorn
          Issue Type: Improvement
          Components: core - scheduler
            Reporter: Rainie Li
            Assignee: Rainie Li
         Attachments: queue.yaml, test-job1.yaml, test-job2.yaml, test-job3.yaml

*Issue and Investigation:* 

We’ve observed spark job slowness issues on our prod cluster, especially when 
large jobs are running on the cluster. This performance degradation impacts 
user experience.

When High cluster utilization with numerous pending pods, could cause  large 
jobs that arrive first to reserve resources on nodes. This reservation 
mechanism prevents new jobs from getting necessary resources, which agains 
preemption.

*Test case:*

Pls refer to attached files. 
 # Submit test-job1 to queue-one
 # Once test-job1 is running, Submit test-job2 to queue-two
 # Once test-job2 is running and pending memory reaches to more than 40TB, 
Submit test-job3 to queue-three

*Proposal:*

YuniKorn incorporates multiple scenarios for making reservations. To address 
the current issue, we propose retaining only the preemption-related 
reservations, as preemption relies on reservations to ensure that resources can 
be reallocated later.

The rationale for removing other reservation scenarios is as follows:
 # If a queue's usage exceeds its guaranteed resources, it should not maintain 
reservations.
 # Conversely, if a queue's usage falls below its guaranteed resources, it 
should be able to secure resources through preemption.

*Our fix:* 

We applied the fix internally to remove allocation case here 
[https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/application.go#L1532]
 

 

Seems reservation 
[https://yunikorn.apache.org/release-announce/0.8.0/#resource-reservation] is 
by design, but in our case it agains preemption 

 I would like to open this ticket to have a follow up discussion with the 
community to see what will be better solution to address this issue.  cc 
[~wilfreds] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to