Following up, did you gather any perf data for this? On Sat, Dec 30, 2017 at 8:15 AM, Meghdoot bhattacharya < meghdoo...@yahoo.com.invalid> wrote:
> Zhitao any further updates on this? > > Thx > > > On Dec 13, 2017, at 1:02 PM, Benjamin Mahler <bmah...@apache.org> wrote: > > > > You can check the diff, for example: > > https://github.com/apache/mesos/compare/1.3.0...1.4.0 > > > > I didn't notice any changes that look like they would cause this. > > > > What do the master logs show during the time frame? > > Have you profiled what the master and scheduler are doing during this > time > > frame? > > > >> On Tue, Dec 12, 2017 at 10:46 AM, Zhitao Li <zhitaoli...@gmail.com> > wrote: > >> > >> Hi, > >> > >> We have seen some potential problems when trying to upgrading Mesos from > >> 1.3 to 1.4: when an implicit reconciliation happened for a large > framework > >> (Aurora) , the scheduler would not see any offer for several minutes. > >> Strangely this does not show up once we revert back to 1.3. > >> > >> A couple of questions: > >> > >> 1) Is there any between 1.3 and 1.4 which can make this slower? > >> 2) FWICT by reading code of implicit reconcile, Mesos master sends back > >> status for all active and pending tasks for the framework (which has > 70k+ > >> in our cluster right now) in batch before yielding to any other > messages. > >> Has anyone thought about supporting some kind of "pagination": i.e, > master > >> would only send back N status updates, then delay for S seconds, then > send > >> back next batch of N updates, until all active tasks are handled. This > is > >> pretty much how Aurora triggers explicit reconcile to Mesos, and we > don't > >> see any issue when processing it this way. > >> > >> Thanks! > >> > >> > >> -- > >> Cheers, > >> > >> Zhitao Li > >> >