Re: Improving Airflow SLAs

James Meickle Thu, 24 May 2018 12:14:03 -0700

Just giving this a bump; it's a pretty major rework so I'd love to know
whether this effort is likely to be accepted if I bring it to a PR-able
state, before I invest more time.


On Wed, May 23, 2018 at 1:59 PM, James Meickle <[email protected]>
wrote:

> Hi folks,
>
> I've created a branch off of v1-10-test; the diff can be found here:
> https://github.com/apache/incubator-airflow/compare/v1-
> 10-test...Eronarn:sla_improvements
>
> As a recap, this work is expected to do the following:
>
> - split the "sla" parameter into three independent SLAs: expected
> duration, expected start, and expected finish
> - move the SLA miss callback to be a task-level attribute rather than
> DAG-level (removing a lot of the "batching" functionality)
> - convert the SLA miss email to the default SLA miss callback
> - add a "type" to SLA misses, which will be part of the primary key, and
> can be checked against in the callback to respond appropriately to the type
> of SLA that was missed.
> - don't send SLA misses for skipped tasks, or for backfill jobs
>
> Before I polish up the remaining TODO functions and write a migration and
> tests, I'd appreciate feedback from the maintainers as to whether this
> seems to be on the right track, design-wise. (Note that it's definitely not
> going to pass tests right now; I am having significant problems getting
> Airflow's test suite running locally so I'm not even attempting at the
> moment.)
>
> Thanks,
>
> -James M.
>
> On Wed, May 9, 2018 at 12:43 PM, James Meickle <[email protected]>
> wrote:
>
>> Hi all,
>>
>> Since the response so far has been positive or neutral, I intend to
>> submit one or more PRs targeting 2.0 (I think that some parts will be
>> separable from a larger SLA refactor). I intend to address at least the
>> following JIRA issues:
>>
>> https://issues.apache.org/jira/browse/AIRFLOW-2236
>> https://issues.apache.org/jira/browse/AIRFLOW-1472
>> https://issues.apache.org/jira/browse/AIRFLOW-1360
>> https://issues.apache.org/jira/browse/AIRFLOW-557
>> https://issues.apache.org/jira/browse/AIRFLOW-133
>>
>> Regards,
>>
>> -James M.
>>
>>
>>
>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
>> [email protected]> wrote:
>>
>>> About de-coupling the SLA management process, I've had conversations in
>>> the
>>> direction of renaming the scheduler to "supervisor" to reflect the fact
>>> that it's not just scheduling processes, it does a lot more tasks than
>>> just
>>> that, SLA management being one of them.
>>>
>>> I still think the default should be to require a single supervisor that
>>> would do all the "supervision" work though. I'm generally against
>>> requiring
>>> more types of nodes on the cluster. But perhaps the supervisor could have
>>> switches to be started in modes where it would only do a subset of its
>>> tasks, so that people can run multiple specialized supervisor nodes if
>>> they
>>> want to.
>>>
>>> For the record, I was thinking that renaming the scheduler to supervisor
>>> would likely happen as we re-write it to enable multiple concurrent
>>> supervisor processes. It turns out that parallelizing the scheduler
>>> hasn't
>>> been as critical as I thought it would be originally, especially with the
>>> current multi-process scheduler. Sounds like the community is getting a
>>> lot
>>> of mileage out of this current multi-process scheduler.
>>>
>>> Max
>>>
>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen <[email protected]>
>>> wrote:
>>>
>>> > I would love to see this proposal gets implemented in airflow.
>>> > In our case duration based SLA makes much more sense and I ended up
>>> adding
>>> > a decorator to the execute method in our custom operators.
>>> >
>>> > Best regards,
>>> > Jiening
>>> >
>>> > -----Original Message-----
>>> > From: James Meickle [mailto:[email protected]]
>>> > Sent: Wednesday 02 May 2018 9:00 PM
>>> > To: [email protected]
>>> > Subject: Improving Airflow SLAs [External]
>>> >
>>> > At Quantopian we use Airflow to produce artifacts based on the previous
>>> > day's stock market data. These artifacts are required for us to trade
>>> on
>>> > today's stock market. Therefore, I've been investing time in improving
>>> > Airflow notifications (such as writing PagerDuty and Slack
>>> integrations).
>>> > My attention has turned to Airflow's SLA system, which has some
>>> drawbacks
>>> > for our use case:
>>> >
>>> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
>>> > skipped for this execution date will still trigger emails/callbacks.
>>> This
>>> > is a huge problem for us because we run almost no tasks on weekends
>>> (since
>>> > the stock market isn't open).
>>> >
>>> > 2) Defining SLAs can be awkward because they are relative to the
>>> execution
>>> > date instead of the task start time. There's no way to alert if a task
>>> runs
>>> > for "more than an hour", for any non-trivial DAG. Instead you can only
>>> > express "more than an hour from execution date".  The financial data
>>> we use
>>> > varies in when it arrives, and how long it takes to process (data
>>> volume
>>> > changes frequently); we also have tight timelines that make retries
>>> > difficult, so we want to alert an operator while leaving the task
>>> running,
>>> > rather than failing and then alerting.
>>> >
>>> > 3) SLA miss emails don't have a subject line containing the instance
>>> URL
>>> > (important for us because we run the same DAGs in both
>>> staging/production)
>>> > or the execution date they apply to. When opened, they can get hard to
>>> read
>>> > for even a moderately sized DAG because they include a flat list of
>>> task
>>> > instances that are unsorted (neither alpha nor topo). They are also
>>> lacking
>>> > any links back to the Airflow instance.
>>> >
>>> > 4) SLA emails are not callbacks, and can't be turned off (other than
>>> either
>>> > removing the SLA or removing the email attribute on the task
>>> instance). The
>>> > way that SLA miss callbacks are defined is not intuitive, as in
>>> contrast to
>>> > all other callbacks, they are DAG-level rather than task-level. Also,
>>> the
>>> > call signature is poorly defined: for instance, two of the arguments
>>> are
>>> > just strings produced from the other two arguments.
>>> >
>>> > I have some thoughts about ways to fix these issues:
>>> >
>>> > 1) I just consider this one a bug. If a task instance is skipped, that
>>> was
>>> > intentional, and it should not trigger any alerts.
>>> >
>>> > 2) I think that the `sla=` parameter should be split into something
>>> like
>>> > this:
>>> >
>>> > `expected_start`: Timedelta after execution date, representing when
>>> this
>>> > task must have started by.
>>> > `expected_finish`: Timedelta after execution date, representing when
>>> this
>>> > task must have finished by.
>>> > `expected_duration`: Timedelta after task start, representing how long
>>> it
>>> > is expected to run including all retries.
>>> >
>>> > This would give better operator control over SLAs, particularly for
>>> tasks
>>> > deeper in larger DAGs where exact ordering may be hard to predict.
>>> >
>>> > 3) The emails should be improved to be more operator-friendly, and take
>>> > into account that someone may get a callback for a DAG they don't know
>>> very
>>> > well, or be paged by this notification.
>>> >
>>> > 4.1) All Airflow callbacks should support a list, rather than
>>> requiring a
>>> > single function. (I've written a wrapper that does this, but it would
>>> be
>>> > better for Airflow to just handle this itself.)
>>> >
>>> > 4.2) SLA miss callbacks should be task callbacks that receive context,
>>> like
>>> > all the other callbacks. Having a DAG figure out which tasks have
>>> missed
>>> > SLAs collectively is fine, but getting SLA failures in a batched
>>> callback
>>> > doesn't really make much sense. Per-task callbacks can be fired
>>> > individually within a batch of failures detected at the same time.
>>> >
>>> > 4.3) SLA emails should be the default SLA miss callback function,
>>> rather
>>> > than being hardcoded.
>>> >
>>> > Also, overall, the SLA miss logic is very complicated. It's stuffed
>>> into
>>> > one overloaded function that is responsible for checking for SLA
>>> misses,
>>> > creating database objects for them, filtering tasks, selecting emails,
>>> > rendering, and sending. Refactoring it would be a good maintainability
>>> win.
>>> >
>>> > I am already implementing some of the above in a private branch, but
>>> I'd be
>>> > curious to hear community feedback as to which of these suggestions
>>> might
>>> > be desirable upstream. I could have this ready for Airflow 2.0 if
>>> there is
>>> > interest beyond my own use case.
>>> >
>>>
>>
>>
>

Re: Improving Airflow SLAs

Reply via email to