[
https://issues.apache.org/jira/browse/YUNIKORN-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921450#comment-17921450
]
Craig Condit commented on YUNIKORN-3007:
----------------------------------------
I disagree 100% with removing reservations. However, I think there is room for
some design work on improving how they work and providing ways to customize
that behavior. I propose we use this Jira issue to have a discussion on that.
> Improve YuniKorn reservation logic
> ----------------------------------
>
> Key: YUNIKORN-3007
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3007
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Rainie Li
> Assignee: Rainie Li
> Priority: Major
> Attachments: queue.yaml, test-job1.yaml, test-job2.yaml,
> test-job3.yaml
>
>
> *Issue and Investigation:*
> We’ve observed spark job slowness issues on our prod cluster, especially when
> large jobs are running on the cluster. This performance degradation impacts
> user experience.
> When High cluster utilization with numerous pending pods, could cause large
> jobs that arrive first to reserve resources on nodes. This reservation
> mechanism prevents new jobs from getting necessary resources, which agains
> preemption.
> *Test case:*
> Pls refer to attached files.
> # Submit test-job1 to queue-one
> # Once test-job1 is running, Submit test-job2 to queue-two
> # Once test-job2 is running and pending memory reaches to more than 40TB,
> Submit test-job3 to queue-three
> *Proposal:*
> YuniKorn incorporates multiple scenarios for making reservations. To address
> the current issue, we propose retaining only the preemption-related
> reservations, as preemption relies on reservations to ensure that resources
> can be reallocated later.
> The rationale for removing other reservation scenarios is as follows:
> # If a queue's usage exceeds its guaranteed resources, it should not
> maintain reservations.
> # Conversely, if a queue's usage falls below its guaranteed resources, it
> should be able to secure resources through preemption.
> *Our fix:*
> We applied the fix internally to remove allocation case here
> [https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/application.go#L1532]
>
>
> Seems reservation
> [https://yunikorn.apache.org/release-announce/0.8.0/#resource-reservation] is
> by design, but in our case it agains preemption
> I would like to open this ticket to have a follow up discussion with the
> community to see what will be the better solution to address this issue. cc
> [~wilfreds]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]