Re: Improving Airflow SLAs

Ace Haidrey Thu, 24 May 2018 12:17:14 -0700

Hi James,
I haven’t read everything or looked at the entire PR yet but one thing I wanted 
to ask was you state you move the SLA miss callback to the task level. In our 
org and I can imagine in others, we would like to have the callback stay at the 
DAG level so we can see if the entire pipeline is taking longer than X hours, 
not just if each task is taking more than X hours. Is this still possible, to 
do that feasibly?


> On May 24, 2018, at 12:12 PM, James Meickle <[email protected]> wrote:
> 
> Just giving this a bump; it's a pretty major rework so I'd love to know
> whether this effort is likely to be accepted if I bring it to a PR-able
> state, before I invest more time.
> 
> On Wed, May 23, 2018 at 1:59 PM, James Meickle <[email protected]>
> wrote:
> 
>> Hi folks,
>> 
>> I've created a branch off of v1-10-test; the diff can be found here:
>> https://github.com/apache/incubator-airflow/compare/v1-
>> 10-test...Eronarn:sla_improvements
>> 
>> As a recap, this work is expected to do the following:
>> 
>> - split the "sla" parameter into three independent SLAs: expected
>> duration, expected start, and expected finish
>> - move the SLA miss callback to be a task-level attribute rather than
>> DAG-level (removing a lot of the "batching" functionality)
>> - convert the SLA miss email to the default SLA miss callback
>> - add a "type" to SLA misses, which will be part of the primary key, and
>> can be checked against in the callback to respond appropriately to the type
>> of SLA that was missed.
>> - don't send SLA misses for skipped tasks, or for backfill jobs
>> 
>> Before I polish up the remaining TODO functions and write a migration and
>> tests, I'd appreciate feedback from the maintainers as to whether this
>> seems to be on the right track, design-wise. (Note that it's definitely not
>> going to pass tests right now; I am having significant problems getting
>> Airflow's test suite running locally so I'm not even attempting at the
>> moment.)
>> 
>> Thanks,
>> 
>> -James M.
>> 
>> On Wed, May 9, 2018 at 12:43 PM, James Meickle <[email protected]>
>> wrote:
>> 
>>> Hi all,
>>> 
>>> Since the response so far has been positive or neutral, I intend to
>>> submit one or more PRs targeting 2.0 (I think that some parts will be
>>> separable from a larger SLA refactor). I intend to address at least the
>>> following JIRA issues:
>>> 
>>> https://issues.apache.org/jira/browse/AIRFLOW-2236
>>> https://issues.apache.org/jira/browse/AIRFLOW-1472
>>> https://issues.apache.org/jira/browse/AIRFLOW-1360
>>> https://issues.apache.org/jira/browse/AIRFLOW-557
>>> https://issues.apache.org/jira/browse/AIRFLOW-133
>>> 
>>> Regards,
>>> 
>>> -James M.
>>> 
>>> 
>>> 
>>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
>>> [email protected]> wrote:
>>> 
>>>> About de-coupling the SLA management process, I've had conversations in
>>>> the
>>>> direction of renaming the scheduler to "supervisor" to reflect the fact
>>>> that it's not just scheduling processes, it does a lot more tasks than
>>>> just
>>>> that, SLA management being one of them.
>>>> 
>>>> I still think the default should be to require a single supervisor that
>>>> would do all the "supervision" work though. I'm generally against
>>>> requiring
>>>> more types of nodes on the cluster. But perhaps the supervisor could have
>>>> switches to be started in modes where it would only do a subset of its
>>>> tasks, so that people can run multiple specialized supervisor nodes if
>>>> they
>>>> want to.
>>>> 
>>>> For the record, I was thinking that renaming the scheduler to supervisor
>>>> would likely happen as we re-write it to enable multiple concurrent
>>>> supervisor processes. It turns out that parallelizing the scheduler
>>>> hasn't
>>>> been as critical as I thought it would be originally, especially with the
>>>> current multi-process scheduler. Sounds like the community is getting a
>>>> lot
>>>> of mileage out of this current multi-process scheduler.
>>>> 
>>>> Max
>>>> 
>>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <[email protected]>
>>>> wrote:
>>>> 
>>>>> I would love to see this proposal gets implemented in airflow.
>>>>> In our case duration based SLA makes much more sense and I ended up
>>>> adding
>>>>> a decorator to the execute method in our custom operators.
>>>>> 
>>>>> Best regards,
>>>>> Jiening
>>>>> 
>>>>> -----Original Message-----
>>>>> From: James Meickle [mailto:[email protected]]
>>>>> Sent: Wednesday 02 May 2018 9:00 PM
>>>>> To: [email protected]
>>>>> Subject: Improving Airflow SLAs [External]
>>>>> 
>>>>> At Quantopian we use Airflow to produce artifacts based on the previous
>>>>> day's stock market data. These artifacts are required for us to trade
>>>> on
>>>>> today's stock market. Therefore, I've been investing time in improving
>>>>> Airflow notifications (such as writing PagerDuty and Slack
>>>> integrations).
>>>>> My attention has turned to Airflow's SLA system, which has some
>>>> drawbacks
>>>>> for our use case:
>>>>> 
>>>>> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
>>>>> skipped for this execution date will still trigger emails/callbacks.
>>>> This
>>>>> is a huge problem for us because we run almost no tasks on weekends
>>>> (since
>>>>> the stock market isn't open).
>>>>> 
>>>>> 2) Defining SLAs can be awkward because they are relative to the
>>>> execution
>>>>> date instead of the task start time. There's no way to alert if a task
>>>> runs
>>>>> for "more than an hour", for any non-trivial DAG. Instead you can only
>>>>> express "more than an hour from execution date".  The financial data
>>>> we use
>>>>> varies in when it arrives, and how long it takes to process (data
>>>> volume
>>>>> changes frequently); we also have tight timelines that make retries
>>>>> difficult, so we want to alert an operator while leaving the task
>>>> running,
>>>>> rather than failing and then alerting.
>>>>> 
>>>>> 3) SLA miss emails don't have a subject line containing the instance
>>>> URL
>>>>> (important for us because we run the same DAGs in both
>>>> staging/production)
>>>>> or the execution date they apply to. When opened, they can get hard to
>>>> read
>>>>> for even a moderately sized DAG because they include a flat list of
>>>> task
>>>>> instances that are unsorted (neither alpha nor topo). They are also
>>>> lacking
>>>>> any links back to the Airflow instance.
>>>>> 
>>>>> 4) SLA emails are not callbacks, and can't be turned off (other than
>>>> either
>>>>> removing the SLA or removing the email attribute on the task
>>>> instance). The
>>>>> way that SLA miss callbacks are defined is not intuitive, as in
>>>> contrast to
>>>>> all other callbacks, they are DAG-level rather than task-level. Also,
>>>> the
>>>>> call signature is poorly defined: for instance, two of the arguments
>>>> are
>>>>> just strings produced from the other two arguments.
>>>>> 
>>>>> I have some thoughts about ways to fix these issues:
>>>>> 
>>>>> 1) I just consider this one a bug. If a task instance is skipped, that
>>>> was
>>>>> intentional, and it should not trigger any alerts.
>>>>> 
>>>>> 2) I think that the `sla=` parameter should be split into something
>>>> like
>>>>> this:
>>>>> 
>>>>> `expected_start`: Timedelta after execution date, representing when
>>>> this
>>>>> task must have started by.
>>>>> `expected_finish`: Timedelta after execution date, representing when
>>>> this
>>>>> task must have finished by.
>>>>> `expected_duration`: Timedelta after task start, representing how long
>>>> it
>>>>> is expected to run including all retries.
>>>>> 
>>>>> This would give better operator control over SLAs, particularly for
>>>> tasks
>>>>> deeper in larger DAGs where exact ordering may be hard to predict.
>>>>> 
>>>>> 3) The emails should be improved to be more operator-friendly, and take
>>>>> into account that someone may get a callback for a DAG they don't know
>>>> very
>>>>> well, or be paged by this notification.
>>>>> 
>>>>> 4.1) All Airflow callbacks should support a list, rather than
>>>> requiring a
>>>>> single function. (I've written a wrapper that does this, but it would
>>>> be
>>>>> better for Airflow to just handle this itself.)
>>>>> 
>>>>> 4.2) SLA miss callbacks should be task callbacks that receive context,
>>>> like
>>>>> all the other callbacks. Having a DAG figure out which tasks have
>>>> missed
>>>>> SLAs collectively is fine, but getting SLA failures in a batched
>>>> callback
>>>>> doesn't really make much sense. Per-task callbacks can be fired
>>>>> individually within a batch of failures detected at the same time.
>>>>> 
>>>>> 4.3) SLA emails should be the default SLA miss callback function,
>>>> rather
>>>>> than being hardcoded.
>>>>> 
>>>>> Also, overall, the SLA miss logic is very complicated. It's stuffed
>>>> into
>>>>> one overloaded function that is responsible for checking for SLA
>>>> misses,
>>>>> creating database objects for them, filtering tasks, selecting emails,
>>>>> rendering, and sending. Refactoring it would be a good maintainability
>>>> win.
>>>>> 
>>>>> I am already implementing some of the above in a private branch, but
>>>> I'd be
>>>>> curious to hear community feedback as to which of these suggestions
>>>> might
>>>>> be desirable upstream. I could have this ready for Airflow 2.0 if
>>>> there is
>>>>> interest beyond my own use case.
>>>>> 
>>>> 
>>> 
>>> 
>>

Re: Improving Airflow SLAs

Reply via email to