Repository: aurora
Updated Branches:
  refs/heads/master 165c4dff7 -> 638471036


Fix header levels in monitoring.md

Reviewed at https://reviews.apache.org/r/32830/


Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/63847103
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/63847103
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/63847103

Branch: refs/heads/master
Commit: 63847103606b8ef29dd0901fbc2dd54bf03cc185
Parents: 165c4df
Author: Stephan Erb <[email protected]>
Authored: Wed Apr 8 21:51:58 2015 -0700
Committer: Bill Farner <[email protected]>
Committed: Wed Apr 8 21:51:58 2015 -0700

----------------------------------------------------------------------
 docs/monitoring.md | 47 +++++++++++------------------------------------
 1 file changed, 11 insertions(+), 36 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/aurora/blob/63847103/docs/monitoring.md
----------------------------------------------------------------------
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 8aee669..3cb2a79 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -74,133 +74,108 @@ recommend you start with a strict value after viewing a 
small amount of collecte
 adjust thresholds as you see fit. Feel free to ask us if you would like to 
validate that your alerts
 and thresholds make sense.
 
-#### `jvm_uptime_secs`
+## Important stats
+
+### `jvm_uptime_secs`
 Type: integer counter
 
-#### Description
 The number of seconds the JVM process has been running. Comes from
 
[RuntimeMXBean#getUptime()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/RuntimeMXBean.html#getUptime\(\))
 
-#### Alerting
 Detecting resets (decreasing values) on this stat will tell you that the 
scheduler is failing to
 stay alive.
 
-#### Triage
 Look at the scheduler logs to identify the reason the scheduler is exiting.
 
-#### `system_load_avg`
+### `system_load_avg`
 Type: double gauge
 
-#### Description
 The current load average of the system for the last minute. Comes from
 
[OperatingSystemMXBean#getSystemLoadAverage()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html?is-external=true#getSystemLoadAverage\(\)).
 
-#### Alerting
 A high sustained value suggests that the scheduler machine may be 
over-utilized.
 
-#### Triage
 Use standard unix tools like `top` and `ps` to track down the offending 
process(es).
 
-#### `process_cpu_cores_utilized`
+### `process_cpu_cores_utilized`
 Type: double gauge
 
-#### Description
 The current number of CPU cores in use by the JVM process. This should not 
exceed the number of
 logical CPU cores on the machine. Derived from
 
[OperatingSystemMXBean#getProcessCpuTime()](http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html)
 
-#### Alerting
 A high sustained value indicates that the scheduler is overworked. Due to 
current internal design
 limitations, if this value is sustained at `1`, there is a good chance the 
scheduler is under water.
 
-#### Triage
 There are two main inputs that tend to drive this figure: task scheduling 
attempts and status
 updates from Mesos.  You may see activity in the scheduler logs to give an 
indication of where
 time is being spent.  Beyond that, it really takes good familiarity with the 
code to effectively
 triage this.  We suggest engaging with an Aurora developer.
 
-#### `task_store_LOST`
+### `task_store_LOST`
 Type: integer gauge
 
-#### Description
 The number of tasks stored in the scheduler that are in the `LOST` state, and 
have been rescheduled.
 
-#### Alerting
 If this value is increasing at a high rate, it is a sign of trouble.
 
-#### Triage
 There are many sources of `LOST` tasks in Mesos: the scheduler, master, slave, 
and executor can all
 trigger this.  The first step is to look in the scheduler logs for `LOST` to 
identify where the
 state changes are originating.
 
-#### `scheduler_resource_offers`
+### `scheduler_resource_offers`
 Type: integer counter
 
-#### Description
 The number of resource offers that the scheduler has received.
 
-#### Alerting
 For a healthy scheduler, this value must be increasing over time.
 
-##### Triage
 Assuming the scheduler is up and otherwise healthy, you will want to check if 
the master thinks it
 is sending offers. You should also look at the master's web interface to see 
if it has a large
 number of outstanding offers that it is waiting to be returned.
 
-#### `framework_registered`
+### `framework_registered`
 Type: binary integer counter
 
-#### Description
 Will be `1` for the leading scheduler that is registered with the Mesos 
master, `0` for passive
 schedulers,
 
-#### Alerting
 A sustained period without a `1` (or where `sum() != 1`) warrants 
investigation.
 
-#### Triage
 If there is no leading scheduler, look in the scheduler and master logs for 
why.  If there are
 multiple schedulers claiming leadership, this suggests a split brain and 
warrants filing a critical
 bug.
 
-#### 
`rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)`
+### 
`rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)`
 Type: rate ratio of integer counters
 
-#### Description
 This composes two counters to compute a windowed figure for the latency of 
replicated log writes.
 
-#### Alerting
 A hike in this value suggests disk bandwidth contention.
 
-#### Triage
 Look in scheduler logs for any reported oddness with saving to the replicated 
log. Also use
 standard tools like `vmstat` and `iotop` to identify whether the disk has 
become slow or
 over-utilized. We suggest using a dedicated disk for the replicated log to 
mitigate this.
 
-#### `timed_out_tasks`
+### `timed_out_tasks`
 Type: integer counter
 
-#### Description
 Tracks the number of times the scheduler has given up while waiting
 (for `-transient_task_state_timeout`) to hear back about a task that is in a 
transient state
 (e.g. `ASSIGNED`, `KILLING`), and has moved to `LOST` before rescheduling.
 
-#### Alerting
 This value is currently known to increase occasionally when the scheduler 
fails over
 ([AURORA-740](https://issues.apache.org/jira/browse/AURORA-740)). However, any 
large spike in this
 value warrants investigation.
 
-#### Triage
 The scheduler will log when it times out a task. You should trace the task ID 
of the timed out
 task into the master, slave, and/or executors to determine where the message 
was dropped.
 
-#### `http_500_responses_events`
+### `http_500_responses_events`
 Type: integer counter
 
-#### Description
 The total number of HTTP 500 status responses sent by the scheduler. Includes 
API and asset serving.
 
-#### Alerting
 An increase warrants investigation.
 
-#### Triage
 Look in scheduler logs to identify why the scheduler returned a 500, there 
should be a stack trace.

Reply via email to