[
https://issues.apache.org/jira/browse/YUNIKORN-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055320#comment-18055320
]
Wilfred Spiegelenburg commented on YUNIKORN-3215:
-------------------------------------------------
Three main pain points seem to exist when we look at the performance:
* {{Clone()}} is called for the resources defined in the limit when walking
dow the hierarchy. A clone is pure overhead as the value of the resource
pointed at is never changed and only used as an input in calculations. The
clone can be safely removed.
* The base for the child's limits is the parents limit that are set. New ones
can be added but the size is at least the parent size. The map created is the
default size, which could lead to multiple re-allocations to even fit the
parent size. The map should be initiated with the size of the parent to prevent
that.
* The map that gets passed around keeps tracking each queue based on the path.
This means every queue is tracked until all queues are checked. After a level
is checked that specific level is no longer required and should be dropped.
Instead of tracking multiple levels via the double mapping
{{map[strng]map[string]*resources.Resource}} track a single level.
Performance gain per change in specific test case: single tree 7 levels deep,
2000 leafs at level 7
* no limits
* 2000 limits at level 6 (one level above leafs)
The gains are similar when different trees and limit sets are used. Time is
translated back to the base: no change no limits just queues. Value of 46 means
it took 46 times the base time.
||change||no limits
timed ||with limits
timed||
|no change|1|46.0|
|no clone| 0.8|11.7 |
|pre-alloc map|0.8 | 8.8|
|single level tracking| 0.75|4.4 |
The changes are a marked performance improvement.
Some smaller improvements are added to help GC improvement. These are more
difficult to track as the GC would be outside of the measured time.
> Improve performance of Limit checking
> -------------------------------------
>
> Key: YUNIKORN-3215
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3215
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - common
> Reporter: Wilfred Spiegelenburg
> Assignee: Wilfred Spiegelenburg
> Priority: Critical
>
> When adding large numbers of Limits (users and or group quotas) to configured
> queues the validation code performance drops dramatically.
> The issue causing processing to slow down seems to be linked to the amount of
> garbage that gets generated. In small deployments with a regular updates it
> pushes the up the overall memory usage and could cause CPU starvation for the
> scheduler.
> A large number of queues, 1500+ queues were tested, has limited to no real
> impact on the performance of the validations. Large numbers of limits set on
> non leaf seem to have the biggest impact.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]