[ 
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069858#comment-15069858
 ] 

James Peach commented on MESOS-3157:
------------------------------------

{quote}
allocator becomes unresponsive due to its long event queue
{quote}

It is not the length of the queue, it is the number of long-running events in 
it. For example, if an allocation pass takes 3sec and we queue one every 1sec, 
the queue will grow without bound. The rate of allocation arrival is 
proportional to the amount of churn in the cluster.

{quote}
is it because there are too many slaves in your Mesos cluster? Or too many 
frameworks?
{quote}

Yes, we run fairly large clusters with numerous frameworks (hundreds). See the 
{{HierarchicalAllocator_BENCHMARK_Test.DeclineOffers}} test for a synthetic 
example.

{quote}
means even when a reviveOffers is handled by allocator after a long time, it 
will not take effect immediately (i.e., trigger an allocation so that framework 
can get offers immediately)
{quote}

Yes that is correct. In the scenario when a number of frameworks revive at 
once, we only want to do a single allocation pass across all the slaves, not 
multiple passes. This necessarily entails some sort of batching or delay, 
though that is bounded by the allocation interval.

As I pointed out earlier in this ticket I haven't been able to create a 
benchmark to demonstrate the original problem. I'm working on deploying an 
un-patched Mesos to one of our test clusters to better understand the 
triggering conditions.

> only perform batch resource allocations
> ---------------------------------------
>
>                 Key: MESOS-3157
>                 URL: https://issues.apache.org/jira/browse/MESOS-3157
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation
>            Reporter: James Peach
>            Assignee: James Peach
>
> Our deployment environments have a lot of churn, with many short-live 
> frameworks that often revive offers. Running the allocator takes a long time 
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the 
> allocator process to get very long, and the allocator effectively becomes 
> unresponsive (eg. a revive offers message takes too long to come to the head 
> of the queue).
> We have been running a patch to remove all the event-triggered allocations 
> and only allocate from the batch task 
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves 
> responsiveness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to