[
https://issues.apache.org/jira/browse/YUNIKORN-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584485#comment-17584485
]
Rainie Li commented on YUNIKORN-790:
------------------------------------
hi [~wilfreds] I addressed your comments for
[https://github.com/apache/yunikorn-core/pull/429] && updated change in
[https://github.com/apache/yunikorn-core/pull/435
|https://github.com/apache/yunikorn-core/pull/435]
Can you please review #435? I will abandon #429.
*Summary:*
The reason I added a new method *incAllocatingAcceptedAppsIfCanRun()* is we
increase runningApps when application enters _Starting_ state, this logic is
executed in a separate thread from where we check canRun.
During large scale load testing, we saw concurrency issue: several apps can be
accepted and passed canRun check at the same time. After they got allocated
resource, they will be added to runningApps. It can exceed MaxApplications
value.
*Potential issue & proposed solution:*
With my change, this concurrency issue will still happen if I ran large scale
testing in extreme case: I set 5 Yunikorn queues, each queue maxApplication
value is set to 12. Then I launched 1000 spark jobs concurrently , each queue
got launched 200 jobs. After 1 hour, I saw runningApps exceeded maxApplication
after test ran for an hour.
To address this issue, I added an additional check
[https://github.com/apache/yunikorn-core/pull/435/files#diff-eeab7f6e1845e8d0ca68c02daa7151e8d68a5e7fc6cf4247b74632956d54f359R1201
|https://github.com/apache/yunikorn-core/pull/435/files#diff-eeab7f6e1845e8d0ca68c02daa7151e8d68a5e7fc6cf4247b74632956d54f359R1201]
When we check canRun again in tryNode, we can guarantee runningApps in each
queue always be limited to maxApplication value. I have ran several large scale
load tests to validate it's working. However, with this additional check, it
will slow down the test. That's why I commented this change in PR #435.
*Questions:*
1.Is concurrency issue in large scale load extreme case is accepted? (We will
not have same extreme case in our production environment, not sure others)
2.If 1 is not accepted, any other suggestion to address this concurrency issue
beside the second canRun check in tryNode method?
[~pbacsko] I tested in a real k8s cluster with 85nodes, not minikube.
> Implement MaxApplications enforcement
> -------------------------------------
>
> Key: YUNIKORN-790
> URL: https://issues.apache.org/jira/browse/YUNIKORN-790
> Project: Apache YuniKorn
> Issue Type: New Feature
> Components: core - scheduler
> Reporter: Wilfred Spiegelenburg
> Assignee: Rainie Li
> Priority: Major
> Labels: pull-request-available
>
> Queues have an option to set the MaxApplications that can run in a queue.
> There is currently no code in the scheduler that checks this setting.
> As a new feature we should add the enforcement for this setting:
> * enforce the setting on a leaf queue
> * enforce the setting on a parent, the apps running in a parent queue is
> defined as the sum of all the apps running in all leaf queues of the parent.
> As a side note from a config check: we need to make sure that the parent
> setting cannot be lower than any of the child queues it has. We _must not_
> enforce that the parent setting must be larger than sum of all leafs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]