[
https://issues.apache.org/jira/browse/AURORA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155130#comment-14155130
]
Isaac Councill commented on AURORA-634:
---------------------------------------
+1 for usefulness of this documentation. Got this from Bill Farner on dev list:
There's a wealth of instrumentation exposed at /vars on the scheduler. To
rattle off a few that are a good fit for monitoring:
task_store_LOST
If this value is increasing at a high rate, it's a sign of trouble. Note:
this one is not monotonically increasing, it will decrease when old
terminated tasks are GCed.
scheduler_resource_offers
Must be increasing, rate will depend on cluster size and behavior of other
frameworks.
jvm_uptime_secs
Detecting resets on this value will tell you that the scheduler is failing
to stay alive.
framework_registered
If no schedulers have a '1' on this, then Aurora is not registered with
mesos.
rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)
This gives you a moving window of log append latency, A hike in this value
suggests disk IOP contention
timed_out_tasks
Increase in this value indicates that Aurora is moving tasks into transient
states (e.g. ASSIGNED, KILLING), but not hearing back from mesos promptly.
system_load_avg
A high sustained value here suggests that the machine may be over-utilized.
http_500_responses_events
An increase here indicates internal server errors responding to RPCs and
web UI loading.
> Add a monitoring guide
> ----------------------
>
> Key: AURORA-634
> URL: https://issues.apache.org/jira/browse/AURORA-634
> Project: Aurora
> Issue Type: Story
> Components: Documentation
> Reporter: Bill Farner
>
> Aurora provides a wealth of undocumented telemetry that is useful in
> monitoring a cluster. Add documentation about some of the recommended
> variables to use for monitoring and alerting.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)