[API WG] Meeting cancelled - Oct. 30

2018-10-29 Thread Greg Mann
Hi all,
We currently have no agenda for the meeting tomorrow, and I'll be unable to
attend. For these reasons, I'd like to cancel this one. Our next meeting is
scheduled for Nov. 13 - see you then!

Cheers,
Greg


Re: Dedup mesos agent status updates at framework

2018-10-29 Thread Benjamin Mahler
The timeout behavior sounds like a dangerous scalability tripwire. Consider
revisiting that approach.

On Sun, Oct 28, 2018 at 10:42 PM Varun Gupta 
wrote:

> Mesos Version: 1.6
>
> scheduler has 250k events in its queue: Master master sends status updates
> to scheduler, and scheduler stores them in the queue. Scheduler process in
> FIFO, and once processed (includes persisting to DB) it ack the update.
> These updates are processed asynchronously with a thread pool of 1000 size.
> We are using explicit reconciliation.
> If the ack to Mesos Master is timing out, due to high CPU usage then next
> ack will likely fail too. It slows down processing on Scheduler side,
> meanwhile Mesos Master continuous to send status updates (duplicate ones,
> since old status updates are not ack). This leads to building up of status
> updates at Scheduler to be processed, and we have seen it to grow upto 250k
> status updates.
>
> Timeout is the explicit ack request from Scheduler to Mesos Master.
>
> Mesos Master profiling: Next time, when this issue occurs, I will take the
> dump.
>
> Deduplication is for the status updates present in the queue for scheduler
> to process, idea is to dedup duplicate status updates such that scheduler
> only processes same status update pending in queue once, and ack to Mesos
> Master also ones. It will reduce the load for both Scheduler and Mesos
> Master. After the ack (success/fail) scheduler will remove the status
> update from the queue, and in case of failure, Mesos Master will send
> status update again.
>
>
>
> On Sun, Oct 28, 2018 at 10:15 PM Benjamin Mahler 
> wrote:
>
> > Which version of mesos are you running?
> >
> > > In framework, event updates grow up to 250k
> >
> > What does this mean? The scheduler has 250k events in its queue?
> >
> > > which leads to cascading effect on higher latency at Mesos Master (ack
> > requests with 10s timeout)
> >
> > Can you send us perf stacks of the master during such a time window so
> > that we can see if there are any bottlenecks?
> > http://mesos.apache.org/documentation/latest/performance-profiling/
> >
> > Where is this timeout coming from and how is it used?
> >
> > > simultaneously explore if dedup is an option
> >
> > I don't know what you're referring to in terms of de-duplication. Can you
> > explain how the scheduler's status update processing works? Does it use
> > explicit acknowledgements and process batches asynchronously? Aurora
> > example: https://reviews.apache.org/r/33689/
> >
> > On Sun, Oct 28, 2018 at 8:58 PM Varun Gupta 
> > wrote:
> >
> >> Hi Benjamin,
> >>
> >> In our batch workload use case, number of tasks churn is pretty high. We
> >> have seen 20-30k tasks launch within 10 second window and 100k+ tasks
> >> running.
> >>
> >> In framework, event updates grow up to 250k, which leads to cascading
> >> effect on higher latency at Mesos Master (ack requests with 10s timeout)
> >> as
> >> well as blocking framework to process new since there are too many left
> to
> >> be acknowledged.
> >>
> >> Reconciliation is every 30 mins which also adds pressure on event stream
> >> if
> >> too many unacknowledged.
> >>
> >> I am thinking to experiment with default backoff period from 10s -> 30s
> or
> >> 60s, and simultaneously explore if dedup is an option.
> >>
> >> Thanks,
> >> Varun
> >>
> >> On Sun, Oct 28, 2018 at 6:49 PM Benjamin Mahler 
> >> wrote:
> >>
> >> > Hi Varun,
> >> >
> >> > What problem are you trying to solve precisely? There seems to be an
> >> > implication that the duplicate acknowledgements are expensive. They
> >> should
> >> > be low cost, so that's rather surprising. Do you have any data related
> >> to
> >> > this?
> >> >
> >> > You can also tune the backoff rate on the agents, if the defaults are
> >> too
> >> > noisy in your setup.
> >> >
> >> > Ben
> >> >
> >> > On Sun, Oct 28, 2018 at 4:51 PM Varun Gupta  wrote:
> >> >
> >> > >
> >> > > Hi,
> >> > >>
> >> > >> Mesos agent will send status updates with exponential backoff until
> >> ack
> >> > >> is received.
> >> > >>
> >> > >> If processing of events at framework and sending ack to Master is
> >> > running
> >> > >> slow then it builds a back pressure at framework due to duplicate
> >> > updates
> >> > >> for same status.
> >> > >>
> >> > >> Has someone explored the option to dedup same status update event
> at
> >> > >> framework or is it even advisable to do. End goal is to dedup all
> >> events
> >> > >> and send only one ack back to Master.
> >> > >>
> >> > >> Thanks,
> >> > >> Varun
> >> > >>
> >> > >>
> >> > >>
> >> >
> >>
> >
>