This is a really interesting idea. I haven't done much beyond the built in SLA system in airflow for real-time alerting of anomalous behavior, which might be the most 'native' option available right now. And for more post-mortem type of detective work I'm usually crunching numbers against the metrics from the metadata database myself. I will take a look at TICK or and am interested as well in any other recommendations that come up on this thread.
On Wed, Nov 15, 2017 at 10:38 AM, Sergei Iakhnin <[email protected]> wrote: > I use the TICK stack - https://github.com/influxdata/. You can read more > in > our paper - https://www.biorxiv.org/content/early/2017/09/08/185736 > Basically Telegraf collects metrics (including statsd metrics from Airflow; > Airflow would benefit from more of these), sends them to Influxdb, > Kapacitor has rules on top for anomaly detection, Chronograf and Grafana > for visualization. If the resolution is automatable (service restarts, > etc.) I have an agent that uses Saltstack's HTTP API to communicate with a > configuration management server which takes action to fix the issue. If the > issue is not automatable then send notifications via email and Slack. > > > On Wed, Nov 15, 2017 at 4:23 PM Andrew Maguire <[email protected]> > wrote: > > > Hi All, > > > > Just wondering what some of the best options are to do more advance > > alerting and anomaly detection on task metrics within airflow. > > > > Currently we have a job that sends metrics for each task run to Anodot > > <https://www.anodot.com/> which is a really cool tool. > > > > However as our dags tend to have many tasks and i'm sending about 6 or so > > metrics for each dag run from the airflow database, i've blown through > the > > 50k monthly metrics our Anodot licence covers. > > > > So just wondering what might be a more native way to do task monitoring > in > > Airflow if there is one. > > > > Main use case here is to catch cases where even though a job is still > > running its behaviour has changed significantly which may be a sign of > > something that needs investigation. > > > > Cheers, > > Andy > > > -- > > Sergei >
