Re: Implicit reconcile "pauses" offer stream in large cluster

Benjamin Mahler Wed, 07 Feb 2018 17:05:57 -0800

Following up, did you gather any perf data for this?

On Sat, Dec 30, 2017 at 8:15 AM, Meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:


> Zhitao any further updates on this?
>
> Thx
>
> > On Dec 13, 2017, at 1:02 PM, Benjamin Mahler <bmah...@apache.org> wrote:
> >
> > You can check the diff, for example:
> > https://github.com/apache/mesos/compare/1.3.0...1.4.0
> >
> > I didn't notice any changes that look like they would cause this.
> >
> > What do the master logs show during the time frame?
> > Have you profiled what the master and scheduler are doing during this
> time
> > frame?
> >
> >> On Tue, Dec 12, 2017 at 10:46 AM, Zhitao Li <zhitaoli...@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> We have seen some potential problems when trying to upgrading Mesos from
> >> 1.3 to 1.4: when an implicit reconciliation happened for a large
> framework
> >> (Aurora) , the scheduler would not see any offer for several minutes.
> >> Strangely this does not show up once we revert back to 1.3.
> >>
> >> A couple of questions:
> >>
> >> 1) Is there any between 1.3 and 1.4 which can make this slower?
> >> 2) FWICT by reading code of implicit reconcile, Mesos master sends back
> >> status for all active and pending tasks for the framework (which has
> 70k+
> >> in our cluster right now) in batch before yielding to any other
> messages.
> >> Has anyone thought about supporting some kind of "pagination": i.e,
> master
> >> would only send back N status updates, then delay for S seconds, then
> send
> >> back next batch of N updates, until all active tasks are handled. This
> is
> >> pretty much how Aurora triggers explicit reconcile to Mesos, and we
> don't
> >> see any issue when processing it this way.
> >>
> >> Thanks!
> >>
> >>
> >> --
> >> Cheers,
> >>
> >> Zhitao Li
> >>
>

Re: Implicit reconcile "pauses" offer stream in large cluster

Reply via email to