Re: Isolating metrics collection from master/agent slowness

Benjamin Mahler Fri, 26 May 2017 19:43:55 -0700

This is great, I would love to see the metrics collection be low latency.

Here is the response I gave last time, with the 4 options I think we should
tackle:
https://www.mail-archive.com/[email protected]/msg37113.html


IMHO we should defer looking into the 5th option of sampling/caching
internally until we've thoroughly attempted to avoid tripping through the
Process queue, because if we succeed it will become unnecessary.

On Mon, May 22, 2017 at 1:26 PM, Zhitao Li <[email protected]> wrote:

> Thanks for the feedback, James.
>
> Replying to your points inline:
>
> On Mon, May 22, 2017 at 10:56 AM, James Peach <[email protected]> wrote:
>
> >
> > > On May 19, 2017, at 11:35 AM, Zhitao Li <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > I'd like to start a conversation to talk about metrics collection
> > endpoints
> > > (especially `/metrics/snapshot`) behavior.
> > >
> > > Right now, these endpoints are served from the same master/agent's
> > > libprocess, and extensively uses `gauge` to chain further callbacks to
> > > collect various metrics (DRF allocator specifically adds several
> metrics
> > > per role).
> > >
> > > This brings a problem when the system is under load: when the
> > > master/allocator libprocess becomes busy, stats collection itself
> becomes
> > > slow too. Flying dark when the system is under load is specifically
> > painful
> > > for an operator.
> >
> > Yes, sampling metrics should approach zero cost.
> >
> > > I would like to explore the direction of isolating metric collection
> even
> > > when the master is slow. A couple of ideas:
> > >
> > > - (short term) reduce usage of gauge and prefer counter (since I
> believe
> > > they are less affected);
> >
> > I'd rather not squash the semantics for performance reasons. If a metric
> > has gauge semantics, I don't think we should represent that as a Counter.
> >
>
> I recall that I had a previous conversation with @bmahler and he thought
> that certain gauges could be expresses as the differential of two counters.
>
> We definitely cannot express a gauge as a Counter, because gauge value can
> decrease while counter should always be treated as monotonically increasing
> until process restart.
>
>
> >
> > > - alternative implementation of `gauge` which does not contend on
> > > master/allocator's event queue;
> >
> > This is doable in some circumstances, but not always. For example,
> > Master::_uptime_secs() doesn't need to run on the master queue, but
> > Master::_outstanding_offers arguably does. The latter could be
> implemented
> > by sampling an variable that is updated, but that's not very generic, so
> we
> > should try to think of something better.
> >
>
> I agree that this will not be trivial cut. The fact that this is not
> something trivially achievable is the primary I want to start this
> conversation with the general dev community. We can absorb some work to
> optimize on certain hot paths (I suspect roles specific ones in allocator
> being one of them for us), but maintaining this in the long term will
> definitely requires all contributors to help.
>
> w.r.t. examples, it seems that Master::_outstanding_offers simply calls a
> hashmap::size() on a hashmap object, so if the underlying container type
> conforms to C++11's thread safe requirement (
> http://en.cppreference.com/w/cpp/container#Thread_safety), we should at
> least be able to call the size() function with the understanding that we
> might get slightly stale value?
>
> I think a more interesting example is Master::_task_starting(): not only
> that is is not calculated from a simple const method,but also that the
> result is actually generated by iterating on all tasks registered to the
> master. This means the cost of calculating this is linear to number of
> tasks in the cluster.
>
>
>
> >
> > > - serving metrics collection from a different libprocess routine.
> >
> > See MetricsProcess. One (mitigation?) approach would be to sample the
> > metrics at a fixed rate and then serve the cached samples from the
> > MetricsProcess. I expect most installations have multiple clients
> sampling
> > the metrics, so this would at least decouple the sample rate from the
> > metrics request rate.
> >
>
> That sounds a very good idea to start with. I still think certain code
> should be augmented so they maintain the gauge value more efficiently, but
> for whatever code which is harder to rewrite this can definitely improve
> the situation.
>
>
> >
> > >
> > > Any thoughts on these?
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> >
> >
>
>
> --
> Cheers,
>
> Zhitao Li
>

Re: Isolating metrics collection from master/agent slowness

Reply via email to