Metrics collection affected when libprocess queue builds up

Zhitao Li Sun, 18 Dec 2016 21:12:04 -0800

Hi all,

While I was debugging an allocator message queue build up issue on master
(which I plan to share another thread), I noticed that `/metrics/snapshot`
is also badly affected.


For example, when the allocator queue has ~3k dispatches in it (revealed by
the allocator/mesos/event_queue_dispatches gauge), the `/metrics/snapshot`
could take 10-30 seconds to respond.

During an active debugging or outage fighting, this is pretty undesired.

My guess is that many stats collection code relies on *deferring* to
another libprocess and collect the result.

Should we explore a more reliable way to track metrics independently from
libprocess's queue?

-- 
Cheers,

Zhitao Li

Metrics collection affected when libprocess queue builds up

Reply via email to