[
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644708#comment-14644708
]
Benjamin Mahler commented on MESOS-3157:
----------------------------------------
Please make it controlled by a flag, and by default the event triggered
allocations should still be in effect. Note that the lack of event triggered
allocations can have a negative effect for small task high throughput
frameworks (see MESOS-3078 for an example, where we currently are not doing
event triggered notifications on resource recovery but would like to).
First though, it is a "bug" that the allocator is getting backlogged, can we
address the performance issues to make sure that this is really necessary?
Couple of suggestions:
# Filters: filters are not keyed by SlaveID, so every we have to loop through
filters across all the slaves each time we check isFiltered. Can we key these
on SlaveID and only check filters for a particular slave?
# Allocation loop: we continue running through the roles and the frameworks
after the {{available}} resources on the slave become {{!allocatable}}, can we
break out and move on to the next slave at that point?
Let's see how far these two optimizations get us. Also, the current loop
requires we re-compute sortings once for each slave, a bit more involved to
change but doesn't seem necessary.
> only perform batch resource allocations
> ---------------------------------------
>
> Key: MESOS-3157
> URL: https://issues.apache.org/jira/browse/MESOS-3157
> Project: Mesos
> Issue Type: Bug
> Components: allocation
> Reporter: James Peach
> Assignee: James Peach
>
> Our deployment environments have a lot of churn, with many short-live
> frameworks that often revive offers. Running the allocator takes a long time
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the
> allocator process to get very long, and the allocator effectively becomes
> unresponsive (eg. a revive offers message takes too long to come to the head
> of the queue).
> We have been running a patch to remove all the event-triggered allocations
> and only allocate from the batch task
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves
> responsiveness.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)