Hi all, While I was debugging an allocator message queue build up issue on master (which I plan to share another thread), I noticed that `/metrics/snapshot` is also badly affected.
For example, when the allocator queue has ~3k dispatches in it (revealed by the allocator/mesos/event_queue_dispatches gauge), the `/metrics/snapshot` could take 10-30 seconds to respond. During an active debugging or outage fighting, this is pretty undesired. My guess is that many stats collection code relies on *deferring* to another libprocess and collect the result. Should we explore a more reliable way to track metrics independently from libprocess's queue? -- Cheers, Zhitao Li