[ https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069346#comment-15069346 ]
Qian Zhang commented on MESOS-3157: ----------------------------------- [~jamespeach], in the description of this ticket, I can see the issue is the allocator becomes unresponsive due to its long event queue which may cause e.g. a reviveOffers takes long time to be handled by allocator. And I took a look at your posted example changes, for {{HierarchicalAllocatorProcess::reviveOffers()}}, it has been updated to dispatch an allocation rather than doing an allocation synchronously, so that means even when a reviveOffers is handled by allocator after a long time, it will not take effect immediately (i.e., trigger an allocation so that framework can get offers immediately), just like the first example in your comment above, the allocate triggered by the two reviveOffers will not be executed, frameworks can only wait to get offers when the allocate triggered by removeQuota is executed. Will it actually make things worse? And in the description of this ticket, I also see: {quote} Running the allocator takes a long time (from seconds up to minutes) {quote} Did you mean the execution of {{HierarchicalAllocatorProcess::allocate(const hashset<SlaveID>& slaveIds_)}} will take a long time? If so, I'd like to know why it takes so long time (I think it is just some in-memory computation), is it because there are too many slaves in your Mesos cluster? Or too many frameworks? > only perform batch resource allocations > --------------------------------------- > > Key: MESOS-3157 > URL: https://issues.apache.org/jira/browse/MESOS-3157 > Project: Mesos > Issue Type: Bug > Components: allocation > Reporter: James Peach > Assignee: James Peach > > Our deployment environments have a lot of churn, with many short-live > frameworks that often revive offers. Running the allocator takes a long time > (from seconds up to minutes). > In this situation, event-triggered allocation causes the event queue in the > allocator process to get very long, and the allocator effectively becomes > unresponsive (eg. a revive offers message takes too long to come to the head > of the queue). > We have been running a patch to remove all the event-triggered allocations > and only allocate from the batch task > {{HierarchicalAllocatorProcess::batch}}. This works great and really improves > responsiveness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)