[
https://issues.apache.org/jira/browse/MESOS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914674#comment-16914674
]
Meng Zhu commented on MESOS-9806:
---------------------------------
As of now, the performance is close to 1.8.1 even with the addition of limits
enforcement. There will be more improvement as we deprecate the framework
sorter and optimize the role sorter (MESOS-9942 and MESOS-9943).
> Address allocator performance regression due to the addition of quota limits.
> -----------------------------------------------------------------------------
>
> Key: MESOS-9806
> URL: https://issues.apache.org/jira/browse/MESOS-9806
> Project: Mesos
> Issue Type: Improvement
> Components: allocation
> Reporter: Meng Zhu
> Assignee: Meng Zhu
> Priority: Critical
> Labels: resource-management
>
> In MESOS-9802, we removed the quota role sorter which is tech debt.
> However, this slows down the allocator. The problem is that in the first
> stage, even though a cluster might have no active roles with non-default
> quota, the allocator will now have to sort and go through each and every role
> in the cluster. Benchmark result shows that for 1k roles with 2k frameworks,
> the allocator could experience ~50% performance degradation.
> There are a couple of ways to address this issue. For example, we could make
> the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return
> all the roles with non-default quota. Alternatively, an even better approach
> would be to deprecate the sorter concept and just have two standalone
> functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree
> structure (not yet exist in the allocator) and return the sorted roles.
> In addition, when implementing MESOS-8068, we need to do more during the
> allocation cycle. In particular, we need to call shrink many more times than
> before. These all contribute to the performance slowdown. Specifically, for
> the quota oriented benchmark
> `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe
> 2-3x slowdown compared to the previous release (1.8.1):
> Current master:
> QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter
> Made 3500 allocations in 32.051382735secs
> Made 0 allocation in 27.976022773secs
> 1.8.1:
> HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
> Made 3500 allocations in 13.810811063secs
> Made 0 allocation in 9.885972984secs
--
This message was sent by Atlassian Jira
(v8.3.2#803003)