Re: MESOS-4694

Dario Rexin Thu, 07 Jul 2016 11:08:11 -0700

Hi Joris,

that’s great news, thanks! I will add a comment and ping you later.


--
 Dario

> On Jul 7, 2016, at 10:57 AM, Joris Van Remoortere <[email protected]> wrote:
> 
> After syncing with Vinod, we're ok adding this change in the interim. We do 
> want a clear comment in the implementation of suppress explaining that this 
> is a special case and that we will need separate handling if this call 
> becomes parameterized in the future.
> 
> Let me know (ping in mesos slack?) when you feel a sufficient explanation is 
> updated in the patch and I'll schedule time to review them.
> 
> Joris
> 
> — 
> Joris Van Remoortere
> Mesosphere
> 
> On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <[email protected] 
> <mailto:[email protected]>> wrote:
> A bit more context:
> 
> We have a very high number of frameworks on our clusters. In some cases ~6k. 
> The biggest problem is the sort method, which has a complexity of O(n log n) 
> and is called n*m times, where n = number of agents and m = number of roles. 
> So in total we have a complexity of O(n^3 log n). I think reducing n is the 
> most promising optimization here. We have been running this patch in 
> production for quite a while now and have seen huge improvements in general 
> allocation time and also in failover times.
> 
> Also, if we were to add a parameterized version of SUPPRESS, what problems do 
> you see with just differentiating between the two cases?
> 
> Thanks,
> --
>  Dario
> 
> > On Jul 7, 2016, at 8:40 AM, Dario Rexin <[email protected] 
> > <mailto:[email protected]>> wrote:
> >
> > Hi Joris,
> >
> > I still don't really understand why we would parameterize SUPPRESS, to me 
> > that sounds like a case for filters. The idea of SUPPRESS was to completely 
> > stop getting offers.
> >
> > Could you please explain why you think the patch is a hack? To me it just 
> > seems logical to not sort frameworks that don't need to be considered in 
> > the allocator.
> >
> > Thanks,
> > Dario
> >
> >> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <[email protected] 
> >> <mailto:[email protected]>> wrote:
> >>
> >> The reason that SUPPRESS doesn't just deactivate is because the intent was
> >> to be able to parameterize this call. At that point the change wouldn't
> >> work without turning this in to 2 cases.
> >>
> >> I have asked to look at what a parameterized suppress would like and
> >> understand the performance impact of that before we do this.
> >> Have we reached consensus that there's no way to implement a generic
> >> parameterized suppress that is performant?
> >>
> >> There are some refactorings that we had discussed with James, Jacob, and
> >> Ian that seem like lower hanging fruit. After those are made it might be
> >> worth reconsidering whether we need to do this hack.
> >>
> >>
> >>
> >> —
> >> *Joris Van Remoortere*
> >> Mesosphere
> >>
> >>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <[email protected] 
> >>> <mailto:[email protected]>> wrote:
> >>>
> >>> Hi Ben and Dario,
> >>>
> >>> The reason that we have "SUPPRESS" call is as following:
> >>> 1) Act as the complement to the current REVIVE call.
> >>> 2) The HTTP API do not have an API to "Deactivate" a framework, we want to
> >>> use "SUPPRESS", "DECLINE" and "DECLINE_INVERSE_OFFERS" to implement the
> >>> call for "DeactivateFrameworkMessage".
> >>>
> >>> You can also refer to https://issues.apache.org/jira/browse/MESOS-3037 
> >>> <https://issues.apache.org/jira/browse/MESOS-3037> for
> >>> detail.
> >>>
> >>> So I think that Dario's patch is good, we should remove the framework
> >>> clients when "SUPPRESS" and add the framework client back when "REVIVE". 
> >>> to
> >>> ignore those frameworks from sorter.
> >>>
> >>> @Viond, any comments for this?
> >>>
> >>> @Ben, for your concern of the benchmark test result is not easy to
> >>> understand, I have filed a JIRA ticket here
> >>> https://issues.apache.org/jira/browse/MESOS-5800 
> >>> <https://issues.apache.org/jira/browse/MESOS-5800> to trace.
> >>>
> >>> Thanks,
> >>>
> >>> Guangya
> >>>
> >>>
> >>>
> >>>> On Thu, Jul 7, 2016 at 6:01 AM, Dario Rexin <[email protected] 
> >>>> <mailto:[email protected]>> wrote:
> >>>>
> >>>> Hi Vinod,
> >>>>
> >>>> thanks for your reply. The reason it’s so much faster is because the
> >>>> sorting is a lot faster with fewer frameworks. Looping shouldn’t make a
> >>>> huge difference, as it used to just skip over the deactivated frameworks.
> >>>>
> >>>> I don’t know what effects deactivating the framework in the master would
> >>>> have. The framework is still active and listening for events / sending
> >>>> calls. Could you please elaborate?
> >>>>
> >>>> Thanks,
> >>>> --
> >>>>  Dario
> >>>>
> >>>> On Jul 6, 2016, at 2:56 PM, Benjamin Mahler <[email protected] 
> >>>> <mailto:[email protected]>> wrote:
> >>>>
> >>>> +implementer and shepherd of SUPPRESS
> >>>>
> >>>> Is there any reason we didn't already just "deactivate" frameworks that
> >>>> were suppressing offers? That seems to be the natural implementation,
> >>>> performance aside, because the meaning of "deactivated" is: not being
> >>> sent
> >>>> any offers. The patch you posted seems to only take this half-way:
> >>> suppress
> >>>> = deactivation in the allocator, but not in the master.
> >>>>
> >>>> Also, Dario it's a bit hard to interpret these numbers without reading
> >>> the
> >>>> benchmark code. My interpretation of these numbers is that this change
> >>>> makes the allocation loop complete more quickly when there are many
> >>>> frameworks that are in the suppressed state, because we have to loop over
> >>>> fewer clients. Is this an accurate interpretation?
> >>>>
> >>>> On Wed, Jul 6, 2016 at 2:08 PM, Dario Rexin <[email protected] 
> >>>> <mailto:[email protected]>> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I would like to revive https://issues.apache.org/jira/browse/MESOS-4694 
> >>>> <https://issues.apache.org/jira/browse/MESOS-4694>
> >>> <
> >>>> https://issues.apache.org/jira/browse/MESOS-4694 
> >>>> <https://issues.apache.org/jira/browse/MESOS-4694>>, especially
> >>>> https://reviews.apache.org/r/43666/ 
> >>>> <https://reviews.apache.org/r/43666/> 
> >>>> <https://reviews.apache.org/r/43666/ 
> >>>> <https://reviews.apache.org/r/43666/>
> >>>> .
> >>>> We heavily depend on this patch and would love to see it merged. To show
> >>>> the value of this patch, I ran the benchmark from
> >>>> https://reviews.apache.org/r/49616/ 
> >>>> <https://reviews.apache.org/r/49616/> 
> >>>> <https://reviews.apache.org/r/49616/ 
> >>>> <https://reviews.apache.org/r/49616/>
> >>>>
> >>>> first on HEAD and then with the aforementioned patch applied. I took some
> >>>> lines out to make it easier to see the changes over time in the patched
> >>>> version and to keep this email shorter ;). I would love to get some
> >>>> feedback and discuss any necessary changes to get this patch merged.
> >>>>
> >>>> Here are the results:
> >>>>
> >>>> Mesos HEAD:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.064665secs to make 199 offers
> >>>> round 1 allocate took 3.029418secs to make 198 offers
> >>>> round 2 allocate took 3.091427secs to make 197 offers
> >>>> round 3 allocate took 2.955457secs to make 196 offers
> >>>> round 4 allocate took 3.133789secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 3.109859secs to make 149 offers
> >>>> round 51 allocate took 3.062746secs to make 148 offers
> >>>> round 52 allocate took 3.146043secs to make 147 offers
> >>>> round 53 allocate took 3.042948secs to make 146 offers
> >>>> round 54 allocate took 3.097835secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 3.027475secs to make 99 offers
> >>>> round 101 allocate took 3.021641secs to make 98 offers
> >>>> round 102 allocate took 2.9853secs to make 97 offers
> >>>> round 103 allocate took 3.145925secs to make 96 offers
> >>>> round 104 allocate took 2.99094secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 3.080406secs to make 49 offers
> >>>> round 151 allocate took 3.109412secs to make 48 offers
> >>>> round 152 allocate took 2.992129secs to make 47 offers
> >>>> round 153 allocate took 3.405642secs to make 46 offers
> >>>> round 154 allocate took 4.153354secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 3.10015secs to make 4 offers
> >>>> round 196 allocate took 3.029347secs to make 3 offers
> >>>> round 197 allocate took 2.982825secs to make 2 offers
> >>>> round 198 allocate took 2.934595secs to make 1 offers
> >>>> round 199 allocate took 313212us to make 0 offers
> >>>>
> >>>> Mesos HEAD + allocator patch:
> >>>>
> >>>> Using 2000 agents and 200 frameworks
> >>>> round 0 allocate took 3.248205secs to make 199 offers
> >>>> round 1 allocate took 3.170852secs to make 198 offers
> >>>> round 2 allocate took 3.135146secs to make 197 offers
> >>>> round 3 allocate took 3.143857secs to make 196 offers
> >>>> round 4 allocate took 3.127641secs to make 195 offers
> >>>> [...]
> >>>> round 50 allocate took 2.492077secs to make 149 offers
> >>>> round 51 allocate took 2.435054secs to make 148 offers
> >>>> round 52 allocate took 2.472204secs to make 147 offers
> >>>> round 53 allocate took 2.457228secs to make 146 offers
> >>>> round 54 allocate took 2.413916secs to make 145 offers
> >>>> [...]
> >>>> round 100 allocate took 1.645015secs to make 99 offers
> >>>> round 101 allocate took 1.647373secs to make 98 offers
> >>>> round 102 allocate took 1.619147secs to make 97 offers
> >>>> round 103 allocate took 1.625496secs to make 96 offers
> >>>> round 104 allocate took 1.580513secs to make 95 offers
> >>>> [...]
> >>>> round 150 allocate took 1.064716secs to make 49 offers
> >>>> round 151 allocate took 1.065604secs to make 48 offers
> >>>> round 152 allocate took 1.053049secs to make 47 offers
> >>>> round 153 allocate took 1.041333secs to make 46 offers
> >>>> round 154 allocate took 1.0461secs to make 45 offers
> >>>> [...]
> >>>> round 195 allocate took 569640us to make 4 offers
> >>>> round 196 allocate took 562107us to make 3 offers
> >>>> round 197 allocate took 547632us to make 2 offers
> >>>> round 198 allocate took 530765us to make 1 offers
> >>>> round 199 allocate took 24426us to make 0 offers
> >>>>
> >>>> --
> >>>>  Dario
> >>>
> 
>

Re: MESOS-4694

Reply via email to