Which version of mesos are you running? > In framework, event updates grow up to 250k
What does this mean? The scheduler has 250k events in its queue? > which leads to cascading effect on higher latency at Mesos Master (ack requests with 10s timeout) Can you send us perf stacks of the master during such a time window so that we can see if there are any bottlenecks? http://mesos.apache.org/documentation/latest/performance-profiling/ Where is this timeout coming from and how is it used? > simultaneously explore if dedup is an option I don't know what you're referring to in terms of de-duplication. Can you explain how the scheduler's status update processing works? Does it use explicit acknowledgements and process batches asynchronously? Aurora example: https://reviews.apache.org/r/33689/ On Sun, Oct 28, 2018 at 8:58 PM Varun Gupta <var...@uber.com.invalid> wrote: > Hi Benjamin, > > In our batch workload use case, number of tasks churn is pretty high. We > have seen 20-30k tasks launch within 10 second window and 100k+ tasks > running. > > In framework, event updates grow up to 250k, which leads to cascading > effect on higher latency at Mesos Master (ack requests with 10s timeout) as > well as blocking framework to process new since there are too many left to > be acknowledged. > > Reconciliation is every 30 mins which also adds pressure on event stream if > too many unacknowledged. > > I am thinking to experiment with default backoff period from 10s -> 30s or > 60s, and simultaneously explore if dedup is an option. > > Thanks, > Varun > > On Sun, Oct 28, 2018 at 6:49 PM Benjamin Mahler <bmah...@apache.org> > wrote: > > > Hi Varun, > > > > What problem are you trying to solve precisely? There seems to be an > > implication that the duplicate acknowledgements are expensive. They > should > > be low cost, so that's rather surprising. Do you have any data related to > > this? > > > > You can also tune the backoff rate on the agents, if the defaults are too > > noisy in your setup. > > > > Ben > > > > On Sun, Oct 28, 2018 at 4:51 PM Varun Gupta <var...@uber.com> wrote: > > > > > > > > Hi, > > >> > > >> Mesos agent will send status updates with exponential backoff until > ack > > >> is received. > > >> > > >> If processing of events at framework and sending ack to Master is > > running > > >> slow then it builds a back pressure at framework due to duplicate > > updates > > >> for same status. > > >> > > >> Has someone explored the option to dedup same status update event at > > >> framework or is it even advisable to do. End goal is to dedup all > events > > >> and send only one ack back to Master. > > >> > > >> Thanks, > > >> Varun > > >> > > >> > > >> > > >