Just giving this a bump; it's a pretty major rework so I'd love to know whether this effort is likely to be accepted if I bring it to a PR-able state, before I invest more time.
On Wed, May 23, 2018 at 1:59 PM, James Meickle <[email protected]> wrote: > Hi folks, > > I've created a branch off of v1-10-test; the diff can be found here: > https://github.com/apache/incubator-airflow/compare/v1- > 10-test...Eronarn:sla_improvements > > As a recap, this work is expected to do the following: > > - split the "sla" parameter into three independent SLAs: expected > duration, expected start, and expected finish > - move the SLA miss callback to be a task-level attribute rather than > DAG-level (removing a lot of the "batching" functionality) > - convert the SLA miss email to the default SLA miss callback > - add a "type" to SLA misses, which will be part of the primary key, and > can be checked against in the callback to respond appropriately to the type > of SLA that was missed. > - don't send SLA misses for skipped tasks, or for backfill jobs > > Before I polish up the remaining TODO functions and write a migration and > tests, I'd appreciate feedback from the maintainers as to whether this > seems to be on the right track, design-wise. (Note that it's definitely not > going to pass tests right now; I am having significant problems getting > Airflow's test suite running locally so I'm not even attempting at the > moment.) > > Thanks, > > -James M. > > On Wed, May 9, 2018 at 12:43 PM, James Meickle <[email protected]> > wrote: > >> Hi all, >> >> Since the response so far has been positive or neutral, I intend to >> submit one or more PRs targeting 2.0 (I think that some parts will be >> separable from a larger SLA refactor). I intend to address at least the >> following JIRA issues: >> >> https://issues.apache.org/jira/browse/AIRFLOW-2236 >> https://issues.apache.org/jira/browse/AIRFLOW-1472 >> https://issues.apache.org/jira/browse/AIRFLOW-1360 >> https://issues.apache.org/jira/browse/AIRFLOW-557 >> https://issues.apache.org/jira/browse/AIRFLOW-133 >> >> Regards, >> >> -James M. >> >> >> >> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin < >> [email protected]> wrote: >> >>> About de-coupling the SLA management process, I've had conversations in >>> the >>> direction of renaming the scheduler to "supervisor" to reflect the fact >>> that it's not just scheduling processes, it does a lot more tasks than >>> just >>> that, SLA management being one of them. >>> >>> I still think the default should be to require a single supervisor that >>> would do all the "supervision" work though. I'm generally against >>> requiring >>> more types of nodes on the cluster. But perhaps the supervisor could have >>> switches to be started in modes where it would only do a subset of its >>> tasks, so that people can run multiple specialized supervisor nodes if >>> they >>> want to. >>> >>> For the record, I was thinking that renaming the scheduler to supervisor >>> would likely happen as we re-write it to enable multiple concurrent >>> supervisor processes. It turns out that parallelizing the scheduler >>> hasn't >>> been as critical as I thought it would be originally, especially with the >>> current multi-process scheduler. Sounds like the community is getting a >>> lot >>> of mileage out of this current multi-process scheduler. >>> >>> Max >>> >>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <[email protected]> >>> wrote: >>> >>> > I would love to see this proposal gets implemented in airflow. >>> > In our case duration based SLA makes much more sense and I ended up >>> adding >>> > a decorator to the execute method in our custom operators. >>> > >>> > Best regards, >>> > Jiening >>> > >>> > -----Original Message----- >>> > From: James Meickle [mailto:[email protected]] >>> > Sent: Wednesday 02 May 2018 9:00 PM >>> > To: [email protected] >>> > Subject: Improving Airflow SLAs [External] >>> > >>> > At Quantopian we use Airflow to produce artifacts based on the previous >>> > day's stock market data. These artifacts are required for us to trade >>> on >>> > today's stock market. Therefore, I've been investing time in improving >>> > Airflow notifications (such as writing PagerDuty and Slack >>> integrations). >>> > My attention has turned to Airflow's SLA system, which has some >>> drawbacks >>> > for our use case: >>> > >>> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is >>> > skipped for this execution date will still trigger emails/callbacks. >>> This >>> > is a huge problem for us because we run almost no tasks on weekends >>> (since >>> > the stock market isn't open). >>> > >>> > 2) Defining SLAs can be awkward because they are relative to the >>> execution >>> > date instead of the task start time. There's no way to alert if a task >>> runs >>> > for "more than an hour", for any non-trivial DAG. Instead you can only >>> > express "more than an hour from execution date". The financial data >>> we use >>> > varies in when it arrives, and how long it takes to process (data >>> volume >>> > changes frequently); we also have tight timelines that make retries >>> > difficult, so we want to alert an operator while leaving the task >>> running, >>> > rather than failing and then alerting. >>> > >>> > 3) SLA miss emails don't have a subject line containing the instance >>> URL >>> > (important for us because we run the same DAGs in both >>> staging/production) >>> > or the execution date they apply to. When opened, they can get hard to >>> read >>> > for even a moderately sized DAG because they include a flat list of >>> task >>> > instances that are unsorted (neither alpha nor topo). They are also >>> lacking >>> > any links back to the Airflow instance. >>> > >>> > 4) SLA emails are not callbacks, and can't be turned off (other than >>> either >>> > removing the SLA or removing the email attribute on the task >>> instance). The >>> > way that SLA miss callbacks are defined is not intuitive, as in >>> contrast to >>> > all other callbacks, they are DAG-level rather than task-level. Also, >>> the >>> > call signature is poorly defined: for instance, two of the arguments >>> are >>> > just strings produced from the other two arguments. >>> > >>> > I have some thoughts about ways to fix these issues: >>> > >>> > 1) I just consider this one a bug. If a task instance is skipped, that >>> was >>> > intentional, and it should not trigger any alerts. >>> > >>> > 2) I think that the `sla=` parameter should be split into something >>> like >>> > this: >>> > >>> > `expected_start`: Timedelta after execution date, representing when >>> this >>> > task must have started by. >>> > `expected_finish`: Timedelta after execution date, representing when >>> this >>> > task must have finished by. >>> > `expected_duration`: Timedelta after task start, representing how long >>> it >>> > is expected to run including all retries. >>> > >>> > This would give better operator control over SLAs, particularly for >>> tasks >>> > deeper in larger DAGs where exact ordering may be hard to predict. >>> > >>> > 3) The emails should be improved to be more operator-friendly, and take >>> > into account that someone may get a callback for a DAG they don't know >>> very >>> > well, or be paged by this notification. >>> > >>> > 4.1) All Airflow callbacks should support a list, rather than >>> requiring a >>> > single function. (I've written a wrapper that does this, but it would >>> be >>> > better for Airflow to just handle this itself.) >>> > >>> > 4.2) SLA miss callbacks should be task callbacks that receive context, >>> like >>> > all the other callbacks. Having a DAG figure out which tasks have >>> missed >>> > SLAs collectively is fine, but getting SLA failures in a batched >>> callback >>> > doesn't really make much sense. Per-task callbacks can be fired >>> > individually within a batch of failures detected at the same time. >>> > >>> > 4.3) SLA emails should be the default SLA miss callback function, >>> rather >>> > than being hardcoded. >>> > >>> > Also, overall, the SLA miss logic is very complicated. It's stuffed >>> into >>> > one overloaded function that is responsible for checking for SLA >>> misses, >>> > creating database objects for them, filtering tasks, selecting emails, >>> > rendering, and sending. Refactoring it would be a good maintainability >>> win. >>> > >>> > I am already implementing some of the above in a private branch, but >>> I'd be >>> > curious to hear community feedback as to which of these suggestions >>> might >>> > be desirable upstream. I could have this ready for Airflow 2.0 if >>> there is >>> > interest beyond my own use case. >>> > >>> >> >> >
