Improving Airflow SLAs

James Meickle Wed, 02 May 2018 12:01:00 -0700

At Quantopian we use Airflow to produce artifacts based on the previous
day's stock market data. These artifacts are required for us to trade on
today's stock market. Therefore, I've been investing time in improving
Airflow notifications (such as writing PagerDuty and Slack integrations).
My attention has turned to Airflow's SLA system, which has some drawbacks
for our use case:


1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
skipped for this execution date will still trigger emails/callbacks. This
is a huge problem for us because we run almost no tasks on weekends (since
the stock market isn't open).

2) Defining SLAs can be awkward because they are relative to the execution
date instead of the task start time. There's no way to alert if a task runs
for "more than an hour", for any non-trivial DAG. Instead you can only
express "more than an hour from execution date".  The financial data we use
varies in when it arrives, and how long it takes to process (data volume
changes frequently); we also have tight timelines that make retries
difficult, so we want to alert an operator while leaving the task running,
rather than failing and then alerting.

3) SLA miss emails don't have a subject line containing the instance URL
(important for us because we run the same DAGs in both staging/production)
or the execution date they apply to. When opened, they can get hard to read
for even a moderately sized DAG because they include a flat list of task
instances that are unsorted (neither alpha nor topo). They are also lacking
any links back to the Airflow instance.

4) SLA emails are not callbacks, and can't be turned off (other than either
removing the SLA or removing the email attribute on the task instance). The
way that SLA miss callbacks are defined is not intuitive, as in contrast to
all other callbacks, they are DAG-level rather than task-level. Also, the
call signature is poorly defined: for instance, two of the arguments are
just strings produced from the other two arguments.

I have some thoughts about ways to fix these issues:

1) I just consider this one a bug. If a task instance is skipped, that was
intentional, and it should not trigger any alerts.

2) I think that the `sla=` parameter should be split into something like
this:

`expected_start`: Timedelta after execution date, representing when this
task must have started by.
`expected_finish`: Timedelta after execution date, representing when this
task must have finished by.
`expected_duration`: Timedelta after task start, representing how long it
is expected to run including all retries.

This would give better operator control over SLAs, particularly for tasks
deeper in larger DAGs where exact ordering may be hard to predict.

3) The emails should be improved to be more operator-friendly, and take
into account that someone may get a callback for a DAG they don't know very
well, or be paged by this notification.

4.1) All Airflow callbacks should support a list, rather than requiring a
single function. (I've written a wrapper that does this, but it would be
better for Airflow to just handle this itself.)

4.2) SLA miss callbacks should be task callbacks that receive context, like
all the other callbacks. Having a DAG figure out which tasks have missed
SLAs collectively is fine, but getting SLA failures in a batched callback
doesn't really make much sense. Per-task callbacks can be fired
individually within a batch of failures detected at the same time.

4.3) SLA emails should be the default SLA miss callback function, rather
than being hardcoded.

Also, overall, the SLA miss logic is very complicated. It's stuffed into
one overloaded function that is responsible for checking for SLA misses,
creating database objects for them, filtering tasks, selecting emails,
rendering, and sending. Refactoring it would be a good maintainability win.

I am already implementing some of the above in a private branch, but I'd be
curious to hear community feedback as to which of these suggestions might
be desirable upstream. I could have this ready for Airflow 2.0 if there is
interest beyond my own use case.

Improving Airflow SLAs

Reply via email to