Re: Improving Airflow SLAs

Ananth Durai Wed, 02 May 2018 23:27:07 -0700

Since we are talking about the SLA implementation, The current SLA miss
implementation is part of the scheduler code. So in the cases like
scheduler max out the process / not running for some reason, we will miss
all the SLA alert. It is worth to decouple SLA alert from the scheduler
path and run as a separate process.



Regards,
Ananth.P,






On 2 May 2018 at 20:31, David Capwell <[email protected]> wrote:

> We use SLA as well and works great for some DAGs and painful for others
>
> We rely on sensors to validate the data is ready before we run and each dag
> waits on sensors for different times (one dag waits for 8 hours since it
> expects date at the start of day but tends to get it 8 hours later).  We
> also have some nested dags that have about 10 tasks deep.
>
> In these two cases SLA warnings come very late since the semantics we see
> is DAG completion time; what we really want is what you were talking about,
> expected execution times
>
> Also SLA trigger on backfills and manual reruns of tasks
>
> I see this as a critical feature for production monitoring so would love to
> see this get improved
>
> On Wed, May 2, 2018, 12:00 PM James Meickle <[email protected]>
> wrote:
>
> > At Quantopian we use Airflow to produce artifacts based on the previous
> > day's stock market data. These artifacts are required for us to trade on
> > today's stock market. Therefore, I've been investing time in improving
> > Airflow notifications (such as writing PagerDuty and Slack integrations).
> > My attention has turned to Airflow's SLA system, which has some drawbacks
> > for our use case:
> >
> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > skipped for this execution date will still trigger emails/callbacks. This
> > is a huge problem for us because we run almost no tasks on weekends
> (since
> > the stock market isn't open).
> >
> > 2) Defining SLAs can be awkward because they are relative to the
> execution
> > date instead of the task start time. There's no way to alert if a task
> runs
> > for "more than an hour", for any non-trivial DAG. Instead you can only
> > express "more than an hour from execution date".  The financial data we
> use
> > varies in when it arrives, and how long it takes to process (data volume
> > changes frequently); we also have tight timelines that make retries
> > difficult, so we want to alert an operator while leaving the task
> running,
> > rather than failing and then alerting.
> >
> > 3) SLA miss emails don't have a subject line containing the instance URL
> > (important for us because we run the same DAGs in both
> staging/production)
> > or the execution date they apply to. When opened, they can get hard to
> read
> > for even a moderately sized DAG because they include a flat list of task
> > instances that are unsorted (neither alpha nor topo). They are also
> lacking
> > any links back to the Airflow instance.
> >
> > 4) SLA emails are not callbacks, and can't be turned off (other than
> either
> > removing the SLA or removing the email attribute on the task instance).
> The
> > way that SLA miss callbacks are defined is not intuitive, as in contrast
> to
> > all other callbacks, they are DAG-level rather than task-level. Also, the
> > call signature is poorly defined: for instance, two of the arguments are
> > just strings produced from the other two arguments.
> >
> > I have some thoughts about ways to fix these issues:
> >
> > 1) I just consider this one a bug. If a task instance is skipped, that
> was
> > intentional, and it should not trigger any alerts.
> >
> > 2) I think that the `sla=` parameter should be split into something like
> > this:
> >
> > `expected_start`: Timedelta after execution date, representing when this
> > task must have started by.
> > `expected_finish`: Timedelta after execution date, representing when this
> > task must have finished by.
> > `expected_duration`: Timedelta after task start, representing how long it
> > is expected to run including all retries.
> >
> > This would give better operator control over SLAs, particularly for tasks
> > deeper in larger DAGs where exact ordering may be hard to predict.
> >
> > 3) The emails should be improved to be more operator-friendly, and take
> > into account that someone may get a callback for a DAG they don't know
> very
> > well, or be paged by this notification.
> >
> > 4.1) All Airflow callbacks should support a list, rather than requiring a
> > single function. (I've written a wrapper that does this, but it would be
> > better for Airflow to just handle this itself.)
> >
> > 4.2) SLA miss callbacks should be task callbacks that receive context,
> like
> > all the other callbacks. Having a DAG figure out which tasks have missed
> > SLAs collectively is fine, but getting SLA failures in a batched callback
> > doesn't really make much sense. Per-task callbacks can be fired
> > individually within a batch of failures detected at the same time.
> >
> > 4.3) SLA emails should be the default SLA miss callback function, rather
> > than being hardcoded.
> >
> > Also, overall, the SLA miss logic is very complicated. It's stuffed into
> > one overloaded function that is responsible for checking for SLA misses,
> > creating database objects for them, filtering tasks, selecting emails,
> > rendering, and sending. Refactoring it would be a good maintainability
> win.
> >
> > I am already implementing some of the above in a private branch, but I'd
> be
> > curious to hear community feedback as to which of these suggestions might
> > be desirable upstream. I could have this ready for Airflow 2.0 if there
> is
> > interest beyond my own use case.
> >
>

Re: Improving Airflow SLAs

Reply via email to