Hi, I'd like to start a conversation to talk about metrics collection endpoints (especially `/metrics/snapshot`) behavior.
Right now, these endpoints are served from the same master/agent's libprocess, and extensively uses `gauge` to chain further callbacks to collect various metrics (DRF allocator specifically adds several metrics per role). This brings a problem when the system is under load: when the master/allocator libprocess becomes busy, stats collection itself becomes slow too. Flying dark when the system is under load is specifically painful for an operator. I would like to explore the direction of isolating metric collection even when the master is slow. A couple of ideas: - (short term) reduce usage of gauge and prefer counter (since I believe they are less affected); - alternative implementation of `gauge` which does not contend on master/allocator's event queue; - serving metrics collection from a different libprocess routine. Any thoughts on these? -- Cheers, Zhitao Li