[jira] [Resolved] (FLINK-14029) Update Flink's Mesos scheduling behavior to reject all expired offers

Till Rohrmann (Jira) Mon, 16 Sep 2019 04:59:13 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann resolved FLINK-14029.
-----------------------------------
    Fix Version/s: 1.10.0
     Release Note: Flink's Mesos integration now rejects all expired offers 
instead of only 4. This improves the situation where Fenzo holds on a lot of 
expired offers without giving them back to the Mesos resource manager.
       Resolution: Fixed

Fixed via e6f87d33ae891dce89463868500f91c3fe01265c

> Update Flink's Mesos scheduling behavior to reject all expired offers
> ---------------------------------------------------------------------
>
>                 Key: FLINK-14029
>                 URL: https://issues.apache.org/jira/browse/FLINK-14029
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Piyush Narang
>            Assignee: Piyush Narang
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> While digging into why our Flink jobs weren't being scheduled on our internal 
> Mesos setup we noticed that we were hitting Mesos quota limits tied to the 
> way we've set up the Fenzo (https://github.com/Netflix/Fenzo/) library 
> defaults in the Flink project. 
> Behavior we noticed was that we got a bunch of offers from our Mesos master 
> (50+) out of which only 1 or 2 of them were super skewed and took up a huge 
> chunk of our disk resource quota. Thanks to this we were not sent any new / 
> different offers (as our usage at the time + resource offers reached our 
> Mesos disk quota). As the Flink / Fenzo Mesos scheduling code was not using 
> the 1-2 skewed disk offers they end up expiring. The way we've set up the 
> Fenzo scheduler is to use the default values on when to expire unused offers 
> (120s) and maximum number of unused offer leases at a time (4). Unfortunately 
> as we have a considerable number of outstanding expired offers (50+) we end 
> up in a situation where we reject only 4 or so every 2 mins and we never get 
> around to rejecting the super skewed disk ones which are stopping us from 
> scheduling our Flink job. Thanks to this we end up in a situation where our 
> job is waiting to be scheduled for more than an hour. 
> An option to work around this is to reject all expired offers at 2 minute 
> expiry time rather than hold on to them. This will allow Mesos to send 
> alternate offers that might be scheduled by Fenzo. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (FLINK-14029) Update Flink's Mesos scheduling behavior to reject all expired offers

Reply via email to