Hi all, Since the response so far has been positive or neutral, I intend to submit one or more PRs targeting 2.0 (I think that some parts will be separable from a larger SLA refactor). I intend to address at least the following JIRA issues:
https://issues.apache.org/jira/browse/AIRFLOW-2236 https://issues.apache.org/jira/browse/AIRFLOW-1472 https://issues.apache.org/jira/browse/AIRFLOW-1360 https://issues.apache.org/jira/browse/AIRFLOW-557 https://issues.apache.org/jira/browse/AIRFLOW-133 Regards, -James M. On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin < [email protected]> wrote: > About de-coupling the SLA management process, I've had conversations in the > direction of renaming the scheduler to "supervisor" to reflect the fact > that it's not just scheduling processes, it does a lot more tasks than just > that, SLA management being one of them. > > I still think the default should be to require a single supervisor that > would do all the "supervision" work though. I'm generally against requiring > more types of nodes on the cluster. But perhaps the supervisor could have > switches to be started in modes where it would only do a subset of its > tasks, so that people can run multiple specialized supervisor nodes if they > want to. > > For the record, I was thinking that renaming the scheduler to supervisor > would likely happen as we re-write it to enable multiple concurrent > supervisor processes. It turns out that parallelizing the scheduler hasn't > been as critical as I thought it would be originally, especially with the > current multi-process scheduler. Sounds like the community is getting a lot > of mileage out of this current multi-process scheduler. > > Max > > On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <[email protected]> > wrote: > > > I would love to see this proposal gets implemented in airflow. > > In our case duration based SLA makes much more sense and I ended up > adding > > a decorator to the execute method in our custom operators. > > > > Best regards, > > Jiening > > > > -----Original Message----- > > From: James Meickle [mailto:[email protected]] > > Sent: Wednesday 02 May 2018 9:00 PM > > To: [email protected] > > Subject: Improving Airflow SLAs [External] > > > > At Quantopian we use Airflow to produce artifacts based on the previous > > day's stock market data. These artifacts are required for us to trade on > > today's stock market. Therefore, I've been investing time in improving > > Airflow notifications (such as writing PagerDuty and Slack integrations). > > My attention has turned to Airflow's SLA system, which has some drawbacks > > for our use case: > > > > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is > > skipped for this execution date will still trigger emails/callbacks. This > > is a huge problem for us because we run almost no tasks on weekends > (since > > the stock market isn't open). > > > > 2) Defining SLAs can be awkward because they are relative to the > execution > > date instead of the task start time. There's no way to alert if a task > runs > > for "more than an hour", for any non-trivial DAG. Instead you can only > > express "more than an hour from execution date". The financial data we > use > > varies in when it arrives, and how long it takes to process (data volume > > changes frequently); we also have tight timelines that make retries > > difficult, so we want to alert an operator while leaving the task > running, > > rather than failing and then alerting. > > > > 3) SLA miss emails don't have a subject line containing the instance URL > > (important for us because we run the same DAGs in both > staging/production) > > or the execution date they apply to. When opened, they can get hard to > read > > for even a moderately sized DAG because they include a flat list of task > > instances that are unsorted (neither alpha nor topo). They are also > lacking > > any links back to the Airflow instance. > > > > 4) SLA emails are not callbacks, and can't be turned off (other than > either > > removing the SLA or removing the email attribute on the task instance). > The > > way that SLA miss callbacks are defined is not intuitive, as in contrast > to > > all other callbacks, they are DAG-level rather than task-level. Also, the > > call signature is poorly defined: for instance, two of the arguments are > > just strings produced from the other two arguments. > > > > I have some thoughts about ways to fix these issues: > > > > 1) I just consider this one a bug. If a task instance is skipped, that > was > > intentional, and it should not trigger any alerts. > > > > 2) I think that the `sla=` parameter should be split into something like > > this: > > > > `expected_start`: Timedelta after execution date, representing when this > > task must have started by. > > `expected_finish`: Timedelta after execution date, representing when this > > task must have finished by. > > `expected_duration`: Timedelta after task start, representing how long it > > is expected to run including all retries. > > > > This would give better operator control over SLAs, particularly for tasks > > deeper in larger DAGs where exact ordering may be hard to predict. > > > > 3) The emails should be improved to be more operator-friendly, and take > > into account that someone may get a callback for a DAG they don't know > very > > well, or be paged by this notification. > > > > 4.1) All Airflow callbacks should support a list, rather than requiring a > > single function. (I've written a wrapper that does this, but it would be > > better for Airflow to just handle this itself.) > > > > 4.2) SLA miss callbacks should be task callbacks that receive context, > like > > all the other callbacks. Having a DAG figure out which tasks have missed > > SLAs collectively is fine, but getting SLA failures in a batched callback > > doesn't really make much sense. Per-task callbacks can be fired > > individually within a batch of failures detected at the same time. > > > > 4.3) SLA emails should be the default SLA miss callback function, rather > > than being hardcoded. > > > > Also, overall, the SLA miss logic is very complicated. It's stuffed into > > one overloaded function that is responsible for checking for SLA misses, > > creating database objects for them, filtering tasks, selecting emails, > > rendering, and sending. Refactoring it would be a good maintainability > win. > > > > I am already implementing some of the above in a private branch, but I'd > be > > curious to hear community feedback as to which of these suggestions might > > be desirable upstream. I could have this ready for Airflow 2.0 if there > is > > interest beyond my own use case. > > >
