[ 
https://issues.apache.org/jira/browse/YUNIKORN-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584485#comment-17584485
 ] 

Rainie Li commented on YUNIKORN-790:
------------------------------------

hi [~wilfreds] I addressed your comments for 
[https://github.com/apache/yunikorn-core/pull/429] && updated change in 
[https://github.com/apache/yunikorn-core/pull/435  
|https://github.com/apache/yunikorn-core/pull/435]

Can you please review #435? I will abandon #429. 

*Summary:* 

The reason I added a new method *incAllocatingAcceptedAppsIfCanRun()* is we 
increase runningApps when application enters _Starting_ state, this logic is 
executed in a separate thread from where we check canRun.

During large scale load testing, we saw concurrency issue: several apps can be 
accepted and passed canRun check at the same time. After they got allocated 
resource, they will be added to runningApps. It can exceed MaxApplications 
value. 

 

*Potential issue & proposed solution:* 

With my change, this concurrency issue will still happen if I ran large scale 
testing in extreme case: I set 5 Yunikorn queues, each queue maxApplication 
value is set to 12. Then I launched 1000 spark jobs concurrently , each queue 
got launched 200 jobs. After 1 hour,  I saw runningApps exceeded maxApplication 
after test ran for an hour.

To address this issue, I added an additional check 
[https://github.com/apache/yunikorn-core/pull/435/files#diff-eeab7f6e1845e8d0ca68c02daa7151e8d68a5e7fc6cf4247b74632956d54f359R1201
 
|https://github.com/apache/yunikorn-core/pull/435/files#diff-eeab7f6e1845e8d0ca68c02daa7151e8d68a5e7fc6cf4247b74632956d54f359R1201]

When we check canRun again in tryNode, we can guarantee runningApps in each 
queue always be limited to maxApplication value. I have ran several large scale 
load tests to validate it's working. However, with this additional check, it 
will slow down the test. That's why I commented this change in PR #435.

 

*Questions:* 

1.Is concurrency issue in large scale load extreme case is accepted? (We will 
not have same extreme case in our production environment, not sure others)

2.If 1 is not accepted, any other suggestion to address this concurrency issue 
beside the second canRun check in tryNode method?

 

[~pbacsko] I tested in a real k8s cluster with 85nodes, not minikube. 

> Implement MaxApplications enforcement
> -------------------------------------
>
>                 Key: YUNIKORN-790
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-790
>             Project: Apache YuniKorn
>          Issue Type: New Feature
>          Components: core - scheduler
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Rainie Li
>            Priority: Major
>              Labels: pull-request-available
>
> Queues have an option to set the MaxApplications that can run in a queue. 
> There is currently no code in the scheduler that checks this setting.
> As a new feature we should add the enforcement for this setting:
>  * enforce the setting on a leaf queue
>  * enforce the setting on a parent, the apps running in a parent queue is 
> defined as the sum of all the apps running in all leaf queues of the parent.
> As a side note from a config check: we need to make sure that the parent 
> setting cannot be lower than any of the child queues it has. We _must not_ 
> enforce that the parent setting must be larger than sum of all leafs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to