Jonathan Boulle created AURORA-178:
--------------------------------------
Summary: Log/observe snapshot operations
Key: AURORA-178
URL: https://issues.apache.org/jira/browse/AURORA-178
Project: Aurora
Issue Type: Task
Components: Scheduler
Reporter: Jonathan Boulle
Priority: Minor
Currently, snapshot operations of excessive duration aren't necessarily obvious
in e.g. the scheduler logs or dashboards. Since this is a potentially
critical/dangerous operation (in some cases leading to zookeeper timeouts +
scheduler suicide), it would be prudent to expose relevant information more
readily (e.g. when the operations commence/complete, timing, etc)
>From Zameer:
{quote}The doSnapshot method of LogStorage is timed with the key
"scheduler_log_snapshot". These are the stats it produces:
scheduler_log_snapshot_events 19
scheduler_log_snapshot_events_per_sec 0.0
scheduler_log_snapshot_nanos_per_event 0.0
scheduler_log_snapshot_nanos_total 373115257383
scheduler_log_snapshot_nanos_total_per_sec 0.0
scheduler_log_snapshot_persist_events 19
scheduler_log_snapshot_persist_events_per_sec 0.0
scheduler_log_snapshot_persist_nanos_per_event 0.0
scheduler_log_snapshot_persist_nanos_total 339151517713
scheduler_log_snapshot_persist_nanos_total_per_sec 0.0
scheduler_log_snapshots 19
Which metric should be tracked in our dashboard?
{quote}
>From Bill F:
{quote}a very long snapshot might never be reflected there if a suicide happens
mid-way through. The minimal fix would be to just LOG when a snapshot is about
to commence.{quote}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)