Review: Needs Fixing
Diff comments:
> diff --git a/ols/launchpad.rules b/ols/launchpad.rules
> index 2db3f47..9a404ba 100644
> --- a/ols/launchpad.rules
> +++ b/ols/launchpad.rules
> @@ -23,6 +23,17 @@ groups:
> labels:
> severity: warning
>
> + - alert: LaunchpadPPAPublisherStuck
> + expr: absent_over_time(lp_script_activity_count{env='production',
> name='publish-distro'}[30m])
Please make this
`env='production',service='launchpad-ppa',name='publish-distro'`, to make sure
that we don't catch other publisher instances.
My experience is that 30 minutes is a bit too much of a hair-trigger for this;
the PPA publisher can be very bursty depending on what people are trying to
publish, and is especially sensitive to large packages or large PPAs. For
example, there was a `publish-distro` run yesterday evening that took 2836
seconds (admittedly while PS5 was having some trouble, but it wouldn't have
struck me as a definite indicator of trouble on its own), and we also need to
allow a little slack for gaps between runs since the cron job also includes
`process-accepted`.
Perhaps in future we can add finer-grained metrics so that we can have more
sensitive alerts, but for now, I'd suggest `1h` as an initial value here so
that we don't exhaust IS with false positives. If need be, we have the option
of adding more sensitive Grafana alerts that go to the ~launchpad-alerts
channel (though I don't think that's particularly necessary here).
> + for: 5m
> + annotations:
> + summary: ppa publisher has not run for 30m
For human-readable text, make this "PPA publisher ...".
> + dashboard_url:
> https://grafana.admin.canonical.com/d/000000044/telegraf-host?orgId=1&from=now-2h&to=now&var-juju_controller=prodstack5-prodstack5-prodstack-is&var-juju_model=All&var-service=launchpad-ppa&var-juju_unit=launchpad-ppa%2F2
Drop `&var-juju_unit=launchpad-ppa%2F2` - it doesn't add any extra selectivity
since that Juju application only has a single unit, and we generally want to
avoid specifying individual units where possible.
> + description: Launchpad Script {{ $labels.name }} is not running as
> expected.
> + playbook_url:
> https://wiki.canonical.com/InformationInfrastructure/IS/LaunchpadScripts#LaunchpadPPAPublisherStuck
> + labels:
> + severity: warning
> +
> - alert: LaunchpadFlagExpiredMembershipsStuck
> expr:
> absent_over_time(lp_script_activity_count{env='production',host='loganberry',name='flag-expired-memberships'}[24h])
> for: 1h
--
https://code.launchpad.net/~ilasc/canonical-is-prometheus/+git/canonical-is-prometheus/+merge/433207
Your team Launchpad code reviewers is subscribed to branch
canonical-is-prometheus:master.
_______________________________________________
Mailing list: https://launchpad.net/~launchpad-reviewers
Post to : [email protected]
Unsubscribe : https://launchpad.net/~launchpad-reviewers
More help : https://help.launchpad.net/ListHelp