The /metrics endpoint exposes a timeout parameter if you want to receive a
response with all of the metrics that were available within the timeout,
e.g. /metrics/snapshot.json?timeout=10secs

I'd recommend using this when collecting metrics so that you can maintain
visibility when a particular component is backlogged.

Should we explore a more reliable way to track metrics independently from
> libprocess's queue?


Note that this problem applies only to our defer-based "Gauge" metrics that
execute on the actor. Counters and Timers are immune to this. I would say
there are a couple of improvements we can make in increasing order of
difficulty:

(1) There are instances of Gauges that might be better represented as
Counters. For example, we expose the actor queue sizes using a gauge (known
to be unfortunate!), when instead we could expose two counters for
"enqueued" and "dequeued" messages and infer size from these. We can also
add the ability for callers to manually increment and decrement their
Gauges rather than go through a dispatch.

(2) Allow Gauge dispatches to be sent to the front of the actor's queue,
rather than the back. I would hope that we don't wind up with a notion of
integer priority for messages. Note that this doesn't solve the problem for
when the "backlog" is occurring inside a single expensive function. It also
has the issue of preventing "progress" if metrics are hit frequently enough
and are expensive enough.

(3) There are instances of Gauges that might be better represented as
thread-safe logic. For example, if we need an actor's std::map member's
.size(), we could call .size() safely so long as the map is not destructed.
In other cases, explicit locking may be needed and is more complicated.

(4) There are instances of Gauges that might be better represented as a
"wrapping" around a data-structure. For example, the std::map could be
wrapped as a 'map_wrapper' that injects metric updates into each non-const
operation that affects the size of the map.

So far I've felt that the timeout and (1) will be sufficient for the
foreseeable future, while (3) and (4) seem to require a significant impact
to non-metrics related code complexity, let me know what you think.

Ben

On Mon, Dec 19, 2016 at 6:32 PM, Zameer Manji <zma...@apache.org> wrote:

> I believe Zhitao is referring to `/metrics/snapshot` returning a result
> after 10-30 seconds.
>
> I think in a typical environment, this will cause most metrics collection
> tooling to timeout. This causes the operator to not have any visibility
> into the system, making debugging/fighting the problem very hard.
>
> On Mon, Dec 19, 2016 at 9:23 PM, haosdent <haosd...@gmail.com> wrote:
>
> > Hi, @zhitao
> >
> > > the `/metrics/snapshot` could take 10-30 seconds to respond.
> >
> > Do you mean it `/metrics/snapshot` return result after 10~30 seconds?
> > Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of `
> > allocator/mesos/event_queue_dispatches gauge`?
> >
> > On Mon, Dec 19, 2016 at 1:11 PM, Zhitao Li <zhitaoli...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > While I was debugging an allocator message queue build up issue on
> master
> > > (which I plan to share another thread), I noticed that
> > `/metrics/snapshot`
> > > is also badly affected.
> > >
> > > For example, when the allocator queue has ~3k dispatches in it
> (revealed
> > by
> > > the allocator/mesos/event_queue_dispatches gauge), the
> > `/metrics/snapshot`
> > > could take 10-30 seconds to respond.
> > >
> > > During an active debugging or outage fighting, this is pretty
> undesired.
> > >
> > > My guess is that many stats collection code relies on *deferring* to
> > > another libprocess and collect the result.
> > >
> > > Should we explore a more reliable way to track metrics independently
> from
> > > libprocess's queue?
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> > >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
> > --
> > Zameer Manji
> >
>

Reply via email to