[ 
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069346#comment-15069346
 ] 

Qian Zhang commented on MESOS-3157:
-----------------------------------

[~jamespeach], in the description of this ticket, I can see the issue is the 
allocator becomes unresponsive due to its long event queue which may cause e.g. 
a reviveOffers takes long time to be handled by allocator. And I took a look at 
your posted example changes, for 
{{HierarchicalAllocatorProcess::reviveOffers()}}, it has been updated to 
dispatch an allocation rather than doing an allocation synchronously, so that 
means even when a reviveOffers is handled by allocator after a long time, it 
will not take effect immediately (i.e., trigger an allocation so that framework 
can get offers immediately), just like the first example in your comment above, 
the allocate triggered by the two reviveOffers will not be executed, frameworks 
can only wait to get offers when the allocate triggered by removeQuota is 
executed. Will it actually make things worse?

And in the description of this ticket, I also see:
{quote} Running the allocator takes a long time (from seconds up to minutes) 
{quote}
Did you mean the execution of {{HierarchicalAllocatorProcess::allocate(const 
hashset<SlaveID>& slaveIds_)}} will take a long time? If so, I'd like to know 
why it takes so long time (I think it is just some in-memory computation), is 
it because there are too many slaves in your Mesos cluster? Or too many 
frameworks?

> only perform batch resource allocations
> ---------------------------------------
>
>                 Key: MESOS-3157
>                 URL: https://issues.apache.org/jira/browse/MESOS-3157
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation
>            Reporter: James Peach
>            Assignee: James Peach
>
> Our deployment environments have a lot of churn, with many short-live 
> frameworks that often revive offers. Running the allocator takes a long time 
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the 
> allocator process to get very long, and the allocator effectively becomes 
> unresponsive (eg. a revive offers message takes too long to come to the head 
> of the queue).
> We have been running a patch to remove all the event-triggered allocations 
> and only allocate from the batch task 
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves 
> responsiveness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to