Re: Metrics collection affected when libprocess queue builds up

Zameer Manji Mon, 19 Dec 2016 18:33:07 -0800

I believe Zhitao is referring to `/metrics/snapshot` returning a result
after 10-30 seconds.


I think in a typical environment, this will cause most metrics collection
tooling to timeout. This causes the operator to not have any visibility
into the system, making debugging/fighting the problem very hard.

On Mon, Dec 19, 2016 at 9:23 PM, haosdent <haosd...@gmail.com> wrote:

> Hi, @zhitao
>
> > the `/metrics/snapshot` could take 10-30 seconds to respond.
>
> Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> allocator/mesos/event_queue_dispatches gauge`?
>
> On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
> > Hi all,
> >
> > While I was debugging an allocator message queue build up issue on master
> > (which I plan to share another thread), I noticed that
> `/metrics/snapshot`
> > is also badly affected.
> >
> > For example, when the allocator queue has ~3k dispatches in it (revealed
> by
> > the allocator/mesos/event_queue_dispatches gauge), the
> `/metrics/snapshot`
> > could take 10-30 seconds to respond.
> >
> > During an active debugging or outage fighting, this is pretty undesired.
> >
> > My guess is that many stats collection code relies on *deferring* to
> > another libprocess and collect the result.
> >
> > Should we explore a more reliable way to track metrics independently from
> > libprocess's queue?
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
> --
> Zameer Manji
>

Re: Metrics collection affected when libprocess queue builds up

Reply via email to