[
https://issues.apache.org/jira/browse/MESOS-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908718#comment-13908718
]
Dominic Hamon commented on MESOS-1028:
--------------------------------------
Instead of reporting percentiles, it might be better to report the entire
distribution. Then the receiver of the endpoint can perform further processing
to determine percentiles or averages as needed.
Building a central service for statistics gathering and publication seems like
a reasonable approach. The statistics could be both global and per
master/slave/framework.
> expose internal metrics
> -----------------------
>
> Key: MESOS-1028
> URL: https://issues.apache.org/jira/browse/MESOS-1028
> Project: Mesos
> Issue Type: Improvement
> Components: general
> Reporter: David Robinson
>
> Mesos should export statistics that provide visibility into its internals.
> This would allow users to detect numerous problem without resorting to
> trolling log files.
> E.g. export counters of (some of these already exist, most don't):
> cgroup create
> cgroup destroy
> cgroup destroy attempts
> resource offers made
> resource offers accepted
> tasks launched
> tasks destroyed
> tasks lost
> writes to replicated log
> queue length
> export 50th, 90th, 95th, 99th percentile of time taken to:
> start mesos (reach a certain state)
> move tasks between two given states (starting -> started)
> create a cgroup
> destroy a cgroup
> send a message from slave to master
> start a task
> stop a task
> register in zookeeper
> write to the replicated log
> Ideally all these metrics would be exposed via a HTTP+JSON endpoint. See
> [metrics|http://metrics.codahale.com/getting-started/] for an example (albeit
> Java) library (or [medida|http://dln.github.io/medida/] for an
> unmaintained(?) c++ port)
> We've previously seen problems where tasks were stuck in cgroup destroy with
> >30,000 attempts. Exposing metrics would allow us to easily detect problems
> like this.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)