That's a very interesting thought, and I have definitely been bitten multiple times by bugs related to how closely tied SLAs are to the scheduler: https://issues.apache.org/jira/browse/AIRFLOW-2178?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20smtp
However, I'm not convinced that adding a new process for a monitoring service would actually be much better architecturally than just improving the scheduler codebase. To have high confidence you'd likely want an external, non-Airflow "is this running" check anyways (for example, we alert if there are no "heartbeating scheduler" log lines). On Thu, May 3, 2018 at 2:26 AM, Ananth Durai <[email protected]> wrote: > Since we are talking about the SLA implementation, The current SLA miss > implementation is part of the scheduler code. So in the cases like > scheduler max out the process / not running for some reason, we will miss > all the SLA alert. It is worth to decouple SLA alert from the scheduler > path and run as a separate process. > > > Regards, > Ananth.P, > > > > > > > On 2 May 2018 at 20:31, David Capwell <[email protected]> wrote: > > > We use SLA as well and works great for some DAGs and painful for others > > > > We rely on sensors to validate the data is ready before we run and each > dag > > waits on sensors for different times (one dag waits for 8 hours since it > > expects date at the start of day but tends to get it 8 hours later). We > > also have some nested dags that have about 10 tasks deep. > > > > In these two cases SLA warnings come very late since the semantics we see > > is DAG completion time; what we really want is what you were talking > about, > > expected execution times > > > > Also SLA trigger on backfills and manual reruns of tasks > > > > I see this as a critical feature for production monitoring so would love > to > > see this get improved > > > > On Wed, May 2, 2018, 12:00 PM James Meickle <[email protected]> > > wrote: > > > > > At Quantopian we use Airflow to produce artifacts based on the previous > > > day's stock market data. These artifacts are required for us to trade > on > > > today's stock market. Therefore, I've been investing time in improving > > > Airflow notifications (such as writing PagerDuty and Slack > integrations). > > > My attention has turned to Airflow's SLA system, which has some > drawbacks > > > for our use case: > > > > > > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is > > > skipped for this execution date will still trigger emails/callbacks. > This > > > is a huge problem for us because we run almost no tasks on weekends > > (since > > > the stock market isn't open). > > > > > > 2) Defining SLAs can be awkward because they are relative to the > > execution > > > date instead of the task start time. There's no way to alert if a task > > runs > > > for "more than an hour", for any non-trivial DAG. Instead you can only > > > express "more than an hour from execution date". The financial data we > > use > > > varies in when it arrives, and how long it takes to process (data > volume > > > changes frequently); we also have tight timelines that make retries > > > difficult, so we want to alert an operator while leaving the task > > running, > > > rather than failing and then alerting. > > > > > > 3) SLA miss emails don't have a subject line containing the instance > URL > > > (important for us because we run the same DAGs in both > > staging/production) > > > or the execution date they apply to. When opened, they can get hard to > > read > > > for even a moderately sized DAG because they include a flat list of > task > > > instances that are unsorted (neither alpha nor topo). They are also > > lacking > > > any links back to the Airflow instance. > > > > > > 4) SLA emails are not callbacks, and can't be turned off (other than > > either > > > removing the SLA or removing the email attribute on the task instance). > > The > > > way that SLA miss callbacks are defined is not intuitive, as in > contrast > > to > > > all other callbacks, they are DAG-level rather than task-level. Also, > the > > > call signature is poorly defined: for instance, two of the arguments > are > > > just strings produced from the other two arguments. > > > > > > I have some thoughts about ways to fix these issues: > > > > > > 1) I just consider this one a bug. If a task instance is skipped, that > > was > > > intentional, and it should not trigger any alerts. > > > > > > 2) I think that the `sla=` parameter should be split into something > like > > > this: > > > > > > `expected_start`: Timedelta after execution date, representing when > this > > > task must have started by. > > > `expected_finish`: Timedelta after execution date, representing when > this > > > task must have finished by. > > > `expected_duration`: Timedelta after task start, representing how long > it > > > is expected to run including all retries. > > > > > > This would give better operator control over SLAs, particularly for > tasks > > > deeper in larger DAGs where exact ordering may be hard to predict. > > > > > > 3) The emails should be improved to be more operator-friendly, and take > > > into account that someone may get a callback for a DAG they don't know > > very > > > well, or be paged by this notification. > > > > > > 4.1) All Airflow callbacks should support a list, rather than > requiring a > > > single function. (I've written a wrapper that does this, but it would > be > > > better for Airflow to just handle this itself.) > > > > > > 4.2) SLA miss callbacks should be task callbacks that receive context, > > like > > > all the other callbacks. Having a DAG figure out which tasks have > missed > > > SLAs collectively is fine, but getting SLA failures in a batched > callback > > > doesn't really make much sense. Per-task callbacks can be fired > > > individually within a batch of failures detected at the same time. > > > > > > 4.3) SLA emails should be the default SLA miss callback function, > rather > > > than being hardcoded. > > > > > > Also, overall, the SLA miss logic is very complicated. It's stuffed > into > > > one overloaded function that is responsible for checking for SLA > misses, > > > creating database objects for them, filtering tasks, selecting emails, > > > rendering, and sending. Refactoring it would be a good maintainability > > win. > > > > > > I am already implementing some of the above in a private branch, but > I'd > > be > > > curious to hear community feedback as to which of these suggestions > might > > > be desirable upstream. I could have this ready for Airflow 2.0 if there > > is > > > interest beyond my own use case. > > > > > >
