Hi James, I haven’t read everything or looked at the entire PR yet but one thing I wanted to ask was you state you move the SLA miss callback to the task level. In our org and I can imagine in others, we would like to have the callback stay at the DAG level so we can see if the entire pipeline is taking longer than X hours, not just if each task is taking more than X hours. Is this still possible, to do that feasibly?
> On May 24, 2018, at 12:12 PM, James Meickle <[email protected]> wrote: > > Just giving this a bump; it's a pretty major rework so I'd love to know > whether this effort is likely to be accepted if I bring it to a PR-able > state, before I invest more time. > > On Wed, May 23, 2018 at 1:59 PM, James Meickle <[email protected]> > wrote: > >> Hi folks, >> >> I've created a branch off of v1-10-test; the diff can be found here: >> https://github.com/apache/incubator-airflow/compare/v1- >> 10-test...Eronarn:sla_improvements >> >> As a recap, this work is expected to do the following: >> >> - split the "sla" parameter into three independent SLAs: expected >> duration, expected start, and expected finish >> - move the SLA miss callback to be a task-level attribute rather than >> DAG-level (removing a lot of the "batching" functionality) >> - convert the SLA miss email to the default SLA miss callback >> - add a "type" to SLA misses, which will be part of the primary key, and >> can be checked against in the callback to respond appropriately to the type >> of SLA that was missed. >> - don't send SLA misses for skipped tasks, or for backfill jobs >> >> Before I polish up the remaining TODO functions and write a migration and >> tests, I'd appreciate feedback from the maintainers as to whether this >> seems to be on the right track, design-wise. (Note that it's definitely not >> going to pass tests right now; I am having significant problems getting >> Airflow's test suite running locally so I'm not even attempting at the >> moment.) >> >> Thanks, >> >> -James M. >> >> On Wed, May 9, 2018 at 12:43 PM, James Meickle <[email protected]> >> wrote: >> >>> Hi all, >>> >>> Since the response so far has been positive or neutral, I intend to >>> submit one or more PRs targeting 2.0 (I think that some parts will be >>> separable from a larger SLA refactor). I intend to address at least the >>> following JIRA issues: >>> >>> https://issues.apache.org/jira/browse/AIRFLOW-2236 >>> https://issues.apache.org/jira/browse/AIRFLOW-1472 >>> https://issues.apache.org/jira/browse/AIRFLOW-1360 >>> https://issues.apache.org/jira/browse/AIRFLOW-557 >>> https://issues.apache.org/jira/browse/AIRFLOW-133 >>> >>> Regards, >>> >>> -James M. >>> >>> >>> >>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin < >>> [email protected]> wrote: >>> >>>> About de-coupling the SLA management process, I've had conversations in >>>> the >>>> direction of renaming the scheduler to "supervisor" to reflect the fact >>>> that it's not just scheduling processes, it does a lot more tasks than >>>> just >>>> that, SLA management being one of them. >>>> >>>> I still think the default should be to require a single supervisor that >>>> would do all the "supervision" work though. I'm generally against >>>> requiring >>>> more types of nodes on the cluster. But perhaps the supervisor could have >>>> switches to be started in modes where it would only do a subset of its >>>> tasks, so that people can run multiple specialized supervisor nodes if >>>> they >>>> want to. >>>> >>>> For the record, I was thinking that renaming the scheduler to supervisor >>>> would likely happen as we re-write it to enable multiple concurrent >>>> supervisor processes. It turns out that parallelizing the scheduler >>>> hasn't >>>> been as critical as I thought it would be originally, especially with the >>>> current multi-process scheduler. Sounds like the community is getting a >>>> lot >>>> of mileage out of this current multi-process scheduler. >>>> >>>> Max >>>> >>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <[email protected]> >>>> wrote: >>>> >>>>> I would love to see this proposal gets implemented in airflow. >>>>> In our case duration based SLA makes much more sense and I ended up >>>> adding >>>>> a decorator to the execute method in our custom operators. >>>>> >>>>> Best regards, >>>>> Jiening >>>>> >>>>> -----Original Message----- >>>>> From: James Meickle [mailto:[email protected]] >>>>> Sent: Wednesday 02 May 2018 9:00 PM >>>>> To: [email protected] >>>>> Subject: Improving Airflow SLAs [External] >>>>> >>>>> At Quantopian we use Airflow to produce artifacts based on the previous >>>>> day's stock market data. These artifacts are required for us to trade >>>> on >>>>> today's stock market. Therefore, I've been investing time in improving >>>>> Airflow notifications (such as writing PagerDuty and Slack >>>> integrations). >>>>> My attention has turned to Airflow's SLA system, which has some >>>> drawbacks >>>>> for our use case: >>>>> >>>>> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is >>>>> skipped for this execution date will still trigger emails/callbacks. >>>> This >>>>> is a huge problem for us because we run almost no tasks on weekends >>>> (since >>>>> the stock market isn't open). >>>>> >>>>> 2) Defining SLAs can be awkward because they are relative to the >>>> execution >>>>> date instead of the task start time. There's no way to alert if a task >>>> runs >>>>> for "more than an hour", for any non-trivial DAG. Instead you can only >>>>> express "more than an hour from execution date". The financial data >>>> we use >>>>> varies in when it arrives, and how long it takes to process (data >>>> volume >>>>> changes frequently); we also have tight timelines that make retries >>>>> difficult, so we want to alert an operator while leaving the task >>>> running, >>>>> rather than failing and then alerting. >>>>> >>>>> 3) SLA miss emails don't have a subject line containing the instance >>>> URL >>>>> (important for us because we run the same DAGs in both >>>> staging/production) >>>>> or the execution date they apply to. When opened, they can get hard to >>>> read >>>>> for even a moderately sized DAG because they include a flat list of >>>> task >>>>> instances that are unsorted (neither alpha nor topo). They are also >>>> lacking >>>>> any links back to the Airflow instance. >>>>> >>>>> 4) SLA emails are not callbacks, and can't be turned off (other than >>>> either >>>>> removing the SLA or removing the email attribute on the task >>>> instance). The >>>>> way that SLA miss callbacks are defined is not intuitive, as in >>>> contrast to >>>>> all other callbacks, they are DAG-level rather than task-level. Also, >>>> the >>>>> call signature is poorly defined: for instance, two of the arguments >>>> are >>>>> just strings produced from the other two arguments. >>>>> >>>>> I have some thoughts about ways to fix these issues: >>>>> >>>>> 1) I just consider this one a bug. If a task instance is skipped, that >>>> was >>>>> intentional, and it should not trigger any alerts. >>>>> >>>>> 2) I think that the `sla=` parameter should be split into something >>>> like >>>>> this: >>>>> >>>>> `expected_start`: Timedelta after execution date, representing when >>>> this >>>>> task must have started by. >>>>> `expected_finish`: Timedelta after execution date, representing when >>>> this >>>>> task must have finished by. >>>>> `expected_duration`: Timedelta after task start, representing how long >>>> it >>>>> is expected to run including all retries. >>>>> >>>>> This would give better operator control over SLAs, particularly for >>>> tasks >>>>> deeper in larger DAGs where exact ordering may be hard to predict. >>>>> >>>>> 3) The emails should be improved to be more operator-friendly, and take >>>>> into account that someone may get a callback for a DAG they don't know >>>> very >>>>> well, or be paged by this notification. >>>>> >>>>> 4.1) All Airflow callbacks should support a list, rather than >>>> requiring a >>>>> single function. (I've written a wrapper that does this, but it would >>>> be >>>>> better for Airflow to just handle this itself.) >>>>> >>>>> 4.2) SLA miss callbacks should be task callbacks that receive context, >>>> like >>>>> all the other callbacks. Having a DAG figure out which tasks have >>>> missed >>>>> SLAs collectively is fine, but getting SLA failures in a batched >>>> callback >>>>> doesn't really make much sense. Per-task callbacks can be fired >>>>> individually within a batch of failures detected at the same time. >>>>> >>>>> 4.3) SLA emails should be the default SLA miss callback function, >>>> rather >>>>> than being hardcoded. >>>>> >>>>> Also, overall, the SLA miss logic is very complicated. It's stuffed >>>> into >>>>> one overloaded function that is responsible for checking for SLA >>>> misses, >>>>> creating database objects for them, filtering tasks, selecting emails, >>>>> rendering, and sending. Refactoring it would be a good maintainability >>>> win. >>>>> >>>>> I am already implementing some of the above in a private branch, but >>>> I'd be >>>>> curious to hear community feedback as to which of these suggestions >>>> might >>>>> be desirable upstream. I could have this ready for Airflow 2.0 if >>>> there is >>>>> interest beyond my own use case. >>>>> >>>> >>> >>> >>
