[jira] [Commented] (MESOS-3157) only perform batch resource allocations

Benjamin Mahler (JIRA) Tue, 28 Jul 2015 10:32:25 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644708#comment-14644708
 ]


Benjamin Mahler commented on MESOS-3157:
----------------------------------------

Please make it controlled by a flag, and by default the event triggered 
allocations should still be in effect. Note that the lack of event triggered 
allocations can have a negative effect for small task high throughput 
frameworks (see MESOS-3078 for an example, where we currently are not doing 
event triggered notifications on resource recovery but would like to).

First though, it is a "bug" that the allocator is getting backlogged, can we 
address the performance issues to make sure that this is really necessary? 
Couple of suggestions:

# Filters: filters are not keyed by SlaveID, so every we have to loop through 
filters across all the slaves each time we check isFiltered. Can we key these 
on SlaveID and only check filters for a particular slave?
# Allocation loop: we continue running through the roles and the frameworks 
after the {{available}} resources on the slave become {{!allocatable}}, can we 
break out and move on to the next slave at that point?

Let's see how far these two optimizations get us. Also, the current loop 
requires we re-compute sortings once for each slave, a bit more involved to 
change but doesn't seem necessary.

> only perform batch resource allocations
> ---------------------------------------
>
>                 Key: MESOS-3157
>                 URL: https://issues.apache.org/jira/browse/MESOS-3157
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation
>            Reporter: James Peach
>            Assignee: James Peach
>
> Our deployment environments have a lot of churn, with many short-live 
> frameworks that often revive offers. Running the allocator takes a long time 
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the 
> allocator process to get very long, and the allocator effectively becomes 
> unresponsive (eg. a revive offers message takes too long to come to the head 
> of the queue).
> We have been running a patch to remove all the event-triggered allocations 
> and only allocate from the batch task 
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves 
> responsiveness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3157) only perform batch resource allocations

Reply via email to