Review: Needs Fixing


Diff comments:

> diff --git a/ols/launchpad.rules b/ols/launchpad.rules
> index 2db3f47..9a404ba 100644
> --- a/ols/launchpad.rules
> +++ b/ols/launchpad.rules
> @@ -23,6 +23,17 @@ groups:
>          labels:
>            severity: warning
>  
> +      - alert: LaunchpadPPAPublisherStuck
> +        expr: absent_over_time(lp_script_activity_count{env='production', 
> name='publish-distro'}[30m])

Please make this 
`env='production',service='launchpad-ppa',name='publish-distro'`, to make sure 
that we don't catch other publisher instances.

My experience is that 30 minutes is a bit too much of a hair-trigger for this; 
the PPA publisher can be very bursty depending on what people are trying to 
publish, and is especially sensitive to large packages or large PPAs.  For 
example, there was a `publish-distro` run yesterday evening that took 2836 
seconds (admittedly while PS5 was having some trouble, but it wouldn't have 
struck me as a definite indicator of trouble on its own), and we also need to 
allow a little slack for gaps between runs since the cron job also includes 
`process-accepted`.

Perhaps in future we can add finer-grained metrics so that we can have more 
sensitive alerts, but for now, I'd suggest `1h` as an initial value here so 
that we don't exhaust IS with false positives.  If need be, we have the option 
of adding more sensitive Grafana alerts that go to the ~launchpad-alerts 
channel (though I don't think that's particularly necessary here).

> +        for: 5m
> +        annotations:
> +          summary: ppa publisher has not run for 30m

For human-readable text, make this "PPA publisher ...".

> +          dashboard_url: 
> https://grafana.admin.canonical.com/d/000000044/telegraf-host?orgId=1&from=now-2h&to=now&var-juju_controller=prodstack5-prodstack5-prodstack-is&var-juju_model=All&var-service=launchpad-ppa&var-juju_unit=launchpad-ppa%2F2

Drop `&var-juju_unit=launchpad-ppa%2F2` - it doesn't add any extra selectivity 
since that Juju application only has a single unit, and we generally want to 
avoid specifying individual units where possible.

> +          description: Launchpad Script {{ $labels.name }} is not running as 
> expected.
> +          playbook_url: 
> https://wiki.canonical.com/InformationInfrastructure/IS/LaunchpadScripts#LaunchpadPPAPublisherStuck
> +        labels:
> +          severity: warning
> +
>        - alert: LaunchpadFlagExpiredMembershipsStuck
>          expr: 
> absent_over_time(lp_script_activity_count{env='production',host='loganberry',name='flag-expired-memberships'}[24h])
>          for: 1h


-- 
https://code.launchpad.net/~ilasc/canonical-is-prometheus/+git/canonical-is-prometheus/+merge/433207
Your team Launchpad code reviewers is subscribed to branch 
canonical-is-prometheus:master.


_______________________________________________
Mailing list: https://launchpad.net/~launchpad-reviewers
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~launchpad-reviewers
More help   : https://help.launchpad.net/ListHelp

Reply via email to