I'd like to start a conversation to talk about metrics collection endpoints
(especially `/metrics/snapshot`) behavior.

Right now, these endpoints are served from the same master/agent's
libprocess, and extensively uses `gauge` to chain further callbacks to
collect various metrics (DRF allocator specifically adds several metrics
per role).

This brings a problem when the system is under load: when the
master/allocator libprocess becomes busy, stats collection itself becomes
slow too. Flying dark when the system is under load is specifically painful
for an operator.

I would like to explore the direction of isolating metric collection even
when the master is slow. A couple of ideas:

- (short term) reduce usage of gauge and prefer counter (since I believe
they are less affected);
- alternative implementation of `gauge` which does not contend on
master/allocator's event queue;
- serving metrics collection from a different libprocess routine.

Any thoughts on these?


Zhitao Li

Reply via email to