Rainie Li created YUNIKORN-3007:
-----------------------------------
Summary: Improve YuniKorn reservation logic
Key: YUNIKORN-3007
URL: https://issues.apache.org/jira/browse/YUNIKORN-3007
Project: Apache YuniKorn
Issue Type: Improvement
Components: core - scheduler
Reporter: Rainie Li
Assignee: Rainie Li
Attachments: queue.yaml, test-job1.yaml, test-job2.yaml, test-job3.yaml
*Issue and Investigation:*
We’ve observed spark job slowness issues on our prod cluster, especially when
large jobs are running on the cluster. This performance degradation impacts
user experience.
When High cluster utilization with numerous pending pods, could cause large
jobs that arrive first to reserve resources on nodes. This reservation
mechanism prevents new jobs from getting necessary resources, which agains
preemption.
*Test case:*
Pls refer to attached files.
# Submit test-job1 to queue-one
# Once test-job1 is running, Submit test-job2 to queue-two
# Once test-job2 is running and pending memory reaches to more than 40TB,
Submit test-job3 to queue-three
*Proposal:*
YuniKorn incorporates multiple scenarios for making reservations. To address
the current issue, we propose retaining only the preemption-related
reservations, as preemption relies on reservations to ensure that resources can
be reallocated later.
The rationale for removing other reservation scenarios is as follows:
# If a queue's usage exceeds its guaranteed resources, it should not maintain
reservations.
# Conversely, if a queue's usage falls below its guaranteed resources, it
should be able to secure resources through preemption.
*Our fix:*
We applied the fix internally to remove allocation case here
[https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/application.go#L1532]
Seems reservation
[https://yunikorn.apache.org/release-announce/0.8.0/#resource-reservation] is
by design, but in our case it agains preemption
I would like to open this ticket to have a follow up discussion with the
community to see what will be better solution to address this issue. cc
[~wilfreds]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]