[jira] [Commented] (MESOS-10015) HierarchicalAllocatorProcess::updateAllocation() can stall the allocator with a huge number of reservations on an agent.

Benjamin Mahler (Jira) Wed, 30 Oct 2019 17:00:46 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963527#comment-16963527
 ]


Benjamin Mahler commented on MESOS-10015:
-----------------------------------------

{noformat}
commit 3f753e77a9e00b884000b59df8797e5422da8ccd
Author: Andrei Sekretenko <[email protected]>
Date:   Wed Oct 30 19:45:33 2019 -0400

    Fixed allocator performance issue in updateAllocation().

    This patch addresses poor performance of
    `HierarchicalAllocatorProcess::updateAllocation()` for agents with
    a huge number of non-addable resources in a many-framework case
    (see MESOS-10015).

    Sorter methods for totals tracking that modify `Resources` of an agent
    in the Sorter are replaced with methods that add/remove resource
    quantities of an agent as a whole (which was actually the only use case
    of the old methods). Thus, subtracting/adding `Resources` of a whole
    agent no longer occurs when updating resources of an agent in a Sorter.

    Further, this patch completely removes agent resource tracking logic
    from the random sorter (which by itself makes no use of them) by
    implementing cluster totals tracking in the allocator.

    Results of `*BENCHMARK_WithReservationParam.UpdateAllocation*`
    (for the DRF sorter):

    Master:
    Agent resources size: 200 (50 frameworks)
    Made 20 reserve and unreserve operations in 2.08586secs
    Agent resources size: 400 (100 frameworks)
    Made 20 reserve and unreserve operations in 13.8449005secs
    Agent resources size: 800 (200 frameworks)
    Made 20 reserve and unreserve operations in 2.19253121188333mins

    Master + this patch:
    Agent resources size: 200 (50 frameworks)
    Made 20 reserve and unreserve operations in 468.482366ms
    Agent resources size: 400 (100 frameworks)
    Made 20 reserve and unreserve operations in 925.725947ms
    Agent resources size: 800 (200 frameworks)
    Made 20 reserve and unreserve operations in 2.110337109secs
    ...
    Agent resources size: 6400 (1600 frameworks)
    Made 20 reserve and unreserve operations in 1.50141861756667mins

    Review: https://reviews.apache.org/r/71646/
{noformat}

> HierarchicalAllocatorProcess::updateAllocation() can stall the allocator with 
> a huge number of reservations on an agent.
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-10015
>                 URL: https://issues.apache.org/jira/browse/MESOS-10015
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.5.3, 1.6.2, 1.7.2, 1.8.1, 1.9.0
>            Reporter: Andrei Sekretenko
>            Assignee: Andrei Sekretenko
>            Priority: Critical
>              Labels: resource-management
>             Fix For: 1.10.0
>
>         Attachments: out.svg
>
>
> Currently, updateAllocation() called for a single-object Resources for a 
> single framework on a single slave requires `(total number of frameworks) * 
> (number of resource objects per this slave)^2` calls of `Resource::addable()`
> In a cluster with a large number of frameworks this results in severe 
> degradation of allocator performance  when a bunch of RESERVE/UNRESERVE 
> operations occurs for an agent with hundreds of unique resources. 
> On our testing cluster task we observed task scheduling delays up to 30 
> minutes due to allocator being occupied with processing UNRESERVE operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-10015) HierarchicalAllocatorProcess::updateAllocation() can stall the allocator with a huge number of reservations on an agent.

Reply via email to