I believe Zhitao is referring to `/metrics/snapshot` returning a result after 10-30 seconds.
I think in a typical environment, this will cause most metrics collection tooling to timeout. This causes the operator to not have any visibility into the system, making debugging/fighting the problem very hard. On Mon, Dec 19, 2016 at 9:23 PM, haosdent <haosd...@gmail.com> wrote: > Hi, @zhitao > > > the `/metrics/snapshot` could take 10-30 seconds to respond. > > Do you mean it `/metrics/snapshot` return result after 10~30 seconds? > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of ` > allocator/mesos/event_queue_dispatches gauge`? > > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zhitaoli...@gmail.com> wrote: > > > Hi all, > > > > While I was debugging an allocator message queue build up issue on master > > (which I plan to share another thread), I noticed that > `/metrics/snapshot` > > is also badly affected. > > > > For example, when the allocator queue has ~3k dispatches in it (revealed > by > > the allocator/mesos/event_queue_dispatches gauge), the > `/metrics/snapshot` > > could take 10-30 seconds to respond. > > > > During an active debugging or outage fighting, this is pretty undesired. > > > > My guess is that many stats collection code relies on *deferring* to > > another libprocess and collect the result. > > > > Should we explore a more reliable way to track metrics independently from > > libprocess's queue? > > > > -- > > Cheers, > > > > Zhitao Li > > > > > > -- > Best Regards, > Haosdent Huang > > -- > Zameer Manji >