Which version of mesos are you running?

> In framework, event updates grow up to 250k

What does this mean? The scheduler has 250k events in its queue?

> which leads to cascading effect on higher latency at Mesos Master (ack
requests with 10s timeout)

Can you send us perf stacks of the master during such a time window so that
we can see if there are any bottlenecks?
http://mesos.apache.org/documentation/latest/performance-profiling/

Where is this timeout coming from and how is it used?

> simultaneously explore if dedup is an option

I don't know what you're referring to in terms of de-duplication. Can you
explain how the scheduler's status update processing works? Does it use
explicit acknowledgements and process batches asynchronously? Aurora
example: https://reviews.apache.org/r/33689/

On Sun, Oct 28, 2018 at 8:58 PM Varun Gupta <var...@uber.com.invalid> wrote:

> Hi Benjamin,
>
> In our batch workload use case, number of tasks churn is pretty high. We
> have seen 20-30k tasks launch within 10 second window and 100k+ tasks
> running.
>
> In framework, event updates grow up to 250k, which leads to cascading
> effect on higher latency at Mesos Master (ack requests with 10s timeout) as
> well as blocking framework to process new since there are too many left to
> be acknowledged.
>
> Reconciliation is every 30 mins which also adds pressure on event stream if
> too many unacknowledged.
>
> I am thinking to experiment with default backoff period from 10s -> 30s or
> 60s, and simultaneously explore if dedup is an option.
>
> Thanks,
> Varun
>
> On Sun, Oct 28, 2018 at 6:49 PM Benjamin Mahler <bmah...@apache.org>
> wrote:
>
> > Hi Varun,
> >
> > What problem are you trying to solve precisely? There seems to be an
> > implication that the duplicate acknowledgements are expensive. They
> should
> > be low cost, so that's rather surprising. Do you have any data related to
> > this?
> >
> > You can also tune the backoff rate on the agents, if the defaults are too
> > noisy in your setup.
> >
> > Ben
> >
> > On Sun, Oct 28, 2018 at 4:51 PM Varun Gupta <var...@uber.com> wrote:
> >
> > >
> > > Hi,
> > >>
> > >> Mesos agent will send status updates with exponential backoff until
> ack
> > >> is received.
> > >>
> > >> If processing of events at framework and sending ack to Master is
> > running
> > >> slow then it builds a back pressure at framework due to duplicate
> > updates
> > >> for same status.
> > >>
> > >> Has someone explored the option to dedup same status update event at
> > >> framework or is it even advisable to do. End goal is to dedup all
> events
> > >> and send only one ack back to Master.
> > >>
> > >> Thanks,
> > >> Varun
> > >>
> > >>
> > >>
> >
>

Reply via email to