Re: Improving Airflow SLAs

James Meickle Thu, 03 May 2018 06:49:04 -0700

That's a very interesting thought, and I have definitely been bitten
multiple times by bugs related to how closely tied SLAs are to the
scheduler:
https://issues.apache.org/jira/browse/AIRFLOW-2178?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20smtp


However, I'm not convinced that adding a new process for a monitoring
service would actually be much better architecturally than just improving
the scheduler codebase. To have high confidence you'd likely want an
external, non-Airflow "is this running" check anyways (for example, we
alert if there are no "heartbeating scheduler" log lines).



On Thu, May 3, 2018 at 2:26 AM, Ananth Durai <[email protected]> wrote:

> Since we are talking about the SLA implementation, The current SLA miss
> implementation is part of the scheduler code. So in the cases like
> scheduler max out the process / not running for some reason, we will miss
> all the SLA alert. It is worth to decouple SLA alert from the scheduler
> path and run as a separate process.
>
>
> Regards,
> Ananth.P,
>
>
>
>
>
>
> On 2 May 2018 at 20:31, David Capwell <[email protected]> wrote:
>
> > We use SLA as well and works great for some DAGs and painful for others
> >
> > We rely on sensors to validate the data is ready before we run and each
> dag
> > waits on sensors for different times (one dag waits for 8 hours since it
> > expects date at the start of day but tends to get it 8 hours later).  We
> > also have some nested dags that have about 10 tasks deep.
> >
> > In these two cases SLA warnings come very late since the semantics we see
> > is DAG completion time; what we really want is what you were talking
> about,
> > expected execution times
> >
> > Also SLA trigger on backfills and manual reruns of tasks
> >
> > I see this as a critical feature for production monitoring so would love
> to
> > see this get improved
> >
> > On Wed, May 2, 2018, 12:00 PM James Meickle <[email protected]>
> > wrote:
> >
> > > At Quantopian we use Airflow to produce artifacts based on the previous
> > > day's stock market data. These artifacts are required for us to trade
> on
> > > today's stock market. Therefore, I've been investing time in improving
> > > Airflow notifications (such as writing PagerDuty and Slack
> integrations).
> > > My attention has turned to Airflow's SLA system, which has some
> drawbacks
> > > for our use case:
> > >
> > > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > > skipped for this execution date will still trigger emails/callbacks.
> This
> > > is a huge problem for us because we run almost no tasks on weekends
> > (since
> > > the stock market isn't open).
> > >
> > > 2) Defining SLAs can be awkward because they are relative to the
> > execution
> > > date instead of the task start time. There's no way to alert if a task
> > runs
> > > for "more than an hour", for any non-trivial DAG. Instead you can only
> > > express "more than an hour from execution date".  The financial data we
> > use
> > > varies in when it arrives, and how long it takes to process (data
> volume
> > > changes frequently); we also have tight timelines that make retries
> > > difficult, so we want to alert an operator while leaving the task
> > running,
> > > rather than failing and then alerting.
> > >
> > > 3) SLA miss emails don't have a subject line containing the instance
> URL
> > > (important for us because we run the same DAGs in both
> > staging/production)
> > > or the execution date they apply to. When opened, they can get hard to
> > read
> > > for even a moderately sized DAG because they include a flat list of
> task
> > > instances that are unsorted (neither alpha nor topo). They are also
> > lacking
> > > any links back to the Airflow instance.
> > >
> > > 4) SLA emails are not callbacks, and can't be turned off (other than
> > either
> > > removing the SLA or removing the email attribute on the task instance).
> > The
> > > way that SLA miss callbacks are defined is not intuitive, as in
> contrast
> > to
> > > all other callbacks, they are DAG-level rather than task-level. Also,
> the
> > > call signature is poorly defined: for instance, two of the arguments
> are
> > > just strings produced from the other two arguments.
> > >
> > > I have some thoughts about ways to fix these issues:
> > >
> > > 1) I just consider this one a bug. If a task instance is skipped, that
> > was
> > > intentional, and it should not trigger any alerts.
> > >
> > > 2) I think that the `sla=` parameter should be split into something
> like
> > > this:
> > >
> > > `expected_start`: Timedelta after execution date, representing when
> this
> > > task must have started by.
> > > `expected_finish`: Timedelta after execution date, representing when
> this
> > > task must have finished by.
> > > `expected_duration`: Timedelta after task start, representing how long
> it
> > > is expected to run including all retries.
> > >
> > > This would give better operator control over SLAs, particularly for
> tasks
> > > deeper in larger DAGs where exact ordering may be hard to predict.
> > >
> > > 3) The emails should be improved to be more operator-friendly, and take
> > > into account that someone may get a callback for a DAG they don't know
> > very
> > > well, or be paged by this notification.
> > >
> > > 4.1) All Airflow callbacks should support a list, rather than
> requiring a
> > > single function. (I've written a wrapper that does this, but it would
> be
> > > better for Airflow to just handle this itself.)
> > >
> > > 4.2) SLA miss callbacks should be task callbacks that receive context,
> > like
> > > all the other callbacks. Having a DAG figure out which tasks have
> missed
> > > SLAs collectively is fine, but getting SLA failures in a batched
> callback
> > > doesn't really make much sense. Per-task callbacks can be fired
> > > individually within a batch of failures detected at the same time.
> > >
> > > 4.3) SLA emails should be the default SLA miss callback function,
> rather
> > > than being hardcoded.
> > >
> > > Also, overall, the SLA miss logic is very complicated. It's stuffed
> into
> > > one overloaded function that is responsible for checking for SLA
> misses,
> > > creating database objects for them, filtering tasks, selecting emails,
> > > rendering, and sending. Refactoring it would be a good maintainability
> > win.
> > >
> > > I am already implementing some of the above in a private branch, but
> I'd
> > be
> > > curious to hear community feedback as to which of these suggestions
> might
> > > be desirable upstream. I could have this ready for Airflow 2.0 if there
> > is
> > > interest beyond my own use case.
> > >
> >
>

Re: Improving Airflow SLAs

Reply via email to