This is a really interesting idea. I haven't done much beyond the built in
SLA system in airflow for real-time alerting of anomalous behavior, which
might be the most 'native' option available right now. And for more
post-mortem type of detective work I'm usually crunching numbers against
the metrics from the metadata database myself. I will take a look at TICK
or and am interested as well in any other recommendations that come up on
this thread.

On Wed, Nov 15, 2017 at 10:38 AM, Sergei Iakhnin <[email protected]> wrote:

> I use the TICK stack - https://github.com/influxdata/. You can read more
> in
> our paper - https://www.biorxiv.org/content/early/2017/09/08/185736
> Basically Telegraf collects metrics (including statsd metrics from Airflow;
> Airflow would benefit from more of these), sends them to Influxdb,
> Kapacitor has rules on top for anomaly detection, Chronograf and Grafana
> for visualization. If the resolution is automatable (service restarts,
> etc.) I have an agent that uses Saltstack's HTTP API to communicate with a
> configuration management server which takes action to fix the issue. If the
> issue is not automatable then send notifications via email and Slack.
>
>
> On Wed, Nov 15, 2017 at 4:23 PM Andrew Maguire <[email protected]>
> wrote:
>
> > Hi All,
> >
> > Just wondering what some of the best options are to do more advance
> > alerting and anomaly detection on task metrics within airflow.
> >
> > Currently we have a job that sends metrics for each task run to Anodot
> > <https://www.anodot.com/> which is a really cool tool.
> >
> > However as our dags tend to have many tasks and i'm sending about 6 or so
> > metrics for each dag run from the airflow database, i've blown through
> the
> > 50k monthly metrics our Anodot licence covers.
> >
> > So just wondering what might be a more native way to do task monitoring
> in
> > Airflow if there is one.
> >
> > Main use case here is to catch cases where even though a job is still
> > running its behaviour has changed significantly which may be a sign of
> > something that needs investigation.
> >
> > Cheers,
> > Andy
> >
> --
>
> Sergei
>

Reply via email to