[
https://issues.apache.org/jira/browse/FLINK-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann resolved FLINK-14029.
-----------------------------------
Fix Version/s: 1.10.0
Release Note: Flink's Mesos integration now rejects all expired offers
instead of only 4. This improves the situation where Fenzo holds on a lot of
expired offers without giving them back to the Mesos resource manager.
Resolution: Fixed
Fixed via e6f87d33ae891dce89463868500f91c3fe01265c
> Update Flink's Mesos scheduling behavior to reject all expired offers
> ---------------------------------------------------------------------
>
> Key: FLINK-14029
> URL: https://issues.apache.org/jira/browse/FLINK-14029
> Project: Flink
> Issue Type: Bug
> Reporter: Piyush Narang
> Assignee: Piyush Narang
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.10.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> While digging into why our Flink jobs weren't being scheduled on our internal
> Mesos setup we noticed that we were hitting Mesos quota limits tied to the
> way we've set up the Fenzo (https://github.com/Netflix/Fenzo/) library
> defaults in the Flink project.
> Behavior we noticed was that we got a bunch of offers from our Mesos master
> (50+) out of which only 1 or 2 of them were super skewed and took up a huge
> chunk of our disk resource quota. Thanks to this we were not sent any new /
> different offers (as our usage at the time + resource offers reached our
> Mesos disk quota). As the Flink / Fenzo Mesos scheduling code was not using
> the 1-2 skewed disk offers they end up expiring. The way we've set up the
> Fenzo scheduler is to use the default values on when to expire unused offers
> (120s) and maximum number of unused offer leases at a time (4). Unfortunately
> as we have a considerable number of outstanding expired offers (50+) we end
> up in a situation where we reject only 4 or so every 2 mins and we never get
> around to rejecting the super skewed disk ones which are stopping us from
> scheduling our Flink job. Thanks to this we end up in a situation where our
> job is waiting to be scheduled for more than an hour.
> An option to work around this is to reject all expired offers at 2 minute
> expiry time rather than hold on to them. This will allow Mesos to send
> alternate offers that might be scheduled by Fenzo.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)