[ 
https://issues.apache.org/jira/browse/MESOS-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094366#comment-15094366
 ] 

Alexander Rukletsov commented on MESOS-4302:
--------------------------------------------

Let me elaborate a bit on the issue and possible workarounds.

First off, the described situation—when the filter is technically never 
applied—may happen not even when the allocator is slow or backlogged. For 
example, if the timeout is set to {{10s}} and allocation interval is {{100s}}. 
Moreover, a 3rdparty allocator can do allocations in arbitrary manner.

However, there is a real problem: idle low share frameworks may block resources 
in "offer-decline" cycles. [~jvanremoortere] nicely summarized the issue in one 
sentence: "It's a shame if the 'default' (5s filter) doesn't co-operate well as 
your cluster scales". We have to fix it.

I would argue that the "right" solution to this problem is a combination of 
quota and suppressing offers. But quota is neither mandatory nor it is 
available before 0.27.0 (while the fix can be easily backported). Currently we 
tend to provide a patch with a small foot-print to fix the transactionality of 
the offer timeout and cherry-pick it Mesos versions prior to 0.27.0.

> Offer filter timeouts are ignored if the allocator is slow or backlogged.
> -------------------------------------------------------------------------
>
>                 Key: MESOS-4302
>                 URL: https://issues.apache.org/jira/browse/MESOS-4302
>             Project: Mesos
>          Issue Type: Improvement
>          Components: allocation
>            Reporter: Benjamin Mahler
>            Assignee: Alexander Rukletsov
>            Priority: Critical
>              Labels: mesosphere
>
> Currently, when the allocator recovers resources from an offer, it creates a 
> filter timeout based on time at which the call is processed.
> This means that if it takes longer than the filter duration for the allocator 
> to perform an allocation for the relevant agent, then the filter is never 
> applied.
> This leads to pathological behavior: if the framework sets a filter duration 
> that is smaller than the wall clock time it takes for us to perform the next 
> allocation, then the filters will have no effect. This can mean that low 
> share frameworks may continue receiving offers that they have no intent to 
> use, without other frameworks ever receiving these offers.
> The workaround for this is for frameworks to set high filter durations, and 
> possibly reviving offers when they need more resources, however, we should 
> fix this issue in the allocator. (i.e. derive the timeout deadlines and 
> expiry based on allocation times).
> This seems to warrant cherry-picking into bug fix releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to