[
https://issues.apache.org/jira/browse/AIRFLOW-249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927793#comment-16927793
]
ASF GitHub Bot commented on AIRFLOW-249:
----------------------------------------
Eronarn commented on pull request #3584: [AIRFLOW-249] Refactor the SLA
mechanism
URL: https://github.com/apache/airflow/pull/3584
### JIRA
AIRFLOW-249 Refactor the SLA mechanism
This is a fairly large patch that should and/or may also address:
AIRFLOW-1360 SLA miss goes to all emails in DAG
AIRFLOW-2236 Airflow SLA is triggered for all backfilled tasks
AIRFLOW-557 SLA notification does not work for manually triggered DAGs
AIRFLOW-133 SLAs don't seem to work with schedule_interval=None
AIRFLOW-1013 airflow/jobs.py:manage_slas() exception for @once dag
### Description
At Quantopian we use Airflow to produce artifacts based on the previous
day's stock market data. These artifacts are required for us to trade on
today's stock market. Therefore, I've been investing time in improving Airflow
notifications (such as writing PagerDuty and Slack integrations). My attention
has turned to Airflow's SLA system, which has some drawbacks for our use case:
1) Defining SLAs can be awkward because they are relative to the execution
date instead of the task start time. There's no way to alert if a task runs for
"more than an hour", for any non-trivial DAG. Instead you can only express
"more than an hour from execution date". The financial data we use varies in
when it arrives, and how long it takes to process (data volume changes
frequently). We also run DAGs with mixed UTC and Eastern events, making Airflow
SLA definitions depend on time of year.
2) Execution timeouts can capture "more than an hour" but do not serve the
same purpose as SLAs. We have tight timelines on long-duration tasks that make
retries difficult, so we want to alert an operator while leaving the original
task running, rather than failing and then alerting.
3) The way that SLA miss callbacks are defined is not intuitive, as in
contrast to other callbacks, they are defined on the DAG rather than on the
task. This has lead to a lot of confusion in JIRA/the mailing list; many people
think that SLAs are for DAG completion, when in reality it's a DAG-level
attribute that handles batched task-level completions. Also, the call signature
is poorly defined: for instance, two of the arguments are just strings produced
from the other two arguments.
4) SLA miss emails don't include any links back to the Airflow instance
(important for us because we run the same DAGs in both staging/production) or
the execution date they apply to. When opened, they be hard to read for even a
moderately sized DAG because they include a flat list of task instances that
are unsorted (neither alpha nor topo).
5) SLA miss emails are sent to every email address associated with the DAG.
This can lead to inadvertent paging of users associated with unrelated "forks"
in the DAG from where the SLA miss failed.
6) SLA emails are not callbacks, and can't be turned off (other than either
removing the SLA or removing the email attribute on the task instance).
This patch attempts to address the above issues by making some of the
following changes:
1) The `sla=` parameter is split into:
`expected_start`: Timedelta after execution date, representing when this
task must have started by.
`expected_finish`: Timedelta after execution date, representing when this
task must have finished by.
`expected_duration`: Timedelta after task start, representing how long this
task is expected to run, including all retries.
These parameters are set on a task (or DAG-level default args), and a task
can have any combination of them, though there is some basic validation logic
to warn you if you try to set an illogical combination. The SlaMiss object
stores the type of SLA miss as a new database field, which is a component of
the primary key.
There is logic to convert the existing `sla=` parameter to `expected_finish`
(as well as a migration), since that's the closest parallel, so it should be
relatively backwards compatible.
2) SLA misses are no longer grouped for checks or callbacks. While there
have always been independent per-task SlaMiss objects, there was a lot of logic
to poll for all SlaMisses that occurred at the same time, and to batch them
into a single email.
3) As a consequence of 2), The `sla_miss_callback` is no longer set on the
DAG level, which has been confusing. It now has a context-based signature to be
consistent with other task callbacks. This change is not backwards compatible
for anyone using custom SLA miss callbacks, but should be a fairly
straightforward conversion.
4) The SLA miss email is now the default SLA miss callback on tasks.
Previously it was an additional non-overrideable feature.
5) The SLA miss email has some improvements:
- Only one SLA miss per email
- SLA-miss-specific title
- Includes a link to the task instance
- Only includes potentially-blocked downstreams
- Sends to a list of "interested subscribers", which is defined as all email
addresses on tasks downstream of the task that missed its SLA.
- Additional ASCII art to help distinguish emails.
6) Move the SLA miss code largely out of the Scheduler code: some into
models (DAGs manage their own SlaMiss objects), and some into SLA helper
functions.
7) Overall, attempt to reduce the complexity and lack of documentation of
the SLA miss logic, given the constraint that the new implementation is a
larger feature and more lines of code. The previous implementation was stuffed
into one overloaded function that is responsible for checking for SLA misses,
creating database objects for them, filtering tasks, selecting emails,
rendering, and sending. These are now broken into multiple functions, which
attempt to be more single-purpose.
### Tests
I have rewritten the existing tests to match this patch; I am open to adding
any required additional tests. (The tests probably also want to be moved out of
the scheduler suite.)
### Commits
- [x] My commits all reference JIRA issues in their subject lines, and I
have squashed multiple commits if they address the same issue. In addition, my
commits follow the guidelines from "[How to write a good git commit
message](http://chris.beams.io/posts/git-commit/)"
### Documentation
- [x] In case of new functionality, my PR adds documentation that describes
how to use it.
(Sort of - I still need to rewrite the high-level, on-website SLA
documentation if this gets approved.)
### Code Quality
- [x] Passes `git diff upstream/master -u -- "*.py" | flake8 --diff`
(except for a part I did not touch, and the core alembic migration template
generating flake8 errors...)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Refactor the SLA mechanism
> --------------------------
>
> Key: AIRFLOW-249
> URL: https://issues.apache.org/jira/browse/AIRFLOW-249
> Project: Apache Airflow
> Issue Type: Improvement
> Reporter: dud
> Priority: Major
>
> Hello
> I've noticed the SLA feature is currently behaving as follow :
> - it doesn't work on DAG scheduled @once or None because they have no
> dag.followwing_schedule property
> - it keeps endlessly checking for SLA misses without ever worrying about any
> end_date. Worse I noticed that emails are still being sent for runs that are
> never happening because of end_date
> - it keeps checking for recent TIs even if SLA notification has been already
> been sent for them
> - the SLA logic is only being fired after following_schedule + sla has
> elapsed, in other words one has to wait for the next TI before having a
> chance of getting any email. Also the email reports dag.following_schedule
> time (I guess because it is close of TI.start_date), but unfortunately that
> doesn't match what the task instances shows nor the log filename
> - the SLA logic is based on max(TI.execution_date) for the starting point of
> its checks, that means that for a DAG whose SLA is longer than its schedule
> period if half of the TIs are running longer than expected it will go
> unnoticed. This could be demonstrated with a DAG like this one :
> {code}
> from airflow import DAG
> from airflow.operators import *
> from datetime import datetime, timedelta
> from time import sleep
> default_args = {
> 'owner': 'airflow',
> 'depends_on_past': False,
> 'start_date': datetime(2016, 6, 16, 12, 20),
> 'email': my_email
> 'sla': timedelta(minutes=2),
> }
> dag = DAG('unnoticed_sla', default_args=default_args,
> schedule_interval=timedelta(minutes=1))
> def alternating_sleep(**kwargs):
> minute = kwargs['execution_date'].strftime("%M")
> is_odd = int(minute) % 2
> if is_odd:
> sleep(300)
> else:
> sleep(10)
> return True
> PythonOperator(
> task_id='sla_miss',
> python_callable=alternating_sleep,
> provide_context=True,
> dag=dag)
> {code}
> I've tried to rework the SLA triggering mechanism by addressing the above
> points., please [have a look on
> it|https://github.com/dud225/incubator-airflow/commit/972260354075683a8d55a1c960d839c37e629e7d]
> I made some tests with this patch :
> - the fluctuent DAG shown above no longer make Airflow skip any SLA event :
> {code}
> task_id | dag_id | execution_date | email_sent |
> timestamp | description | notification_sent
> ----------+---------------+---------------------+------------+----------------------------+-------------+-------------------
> sla_miss | dag_sla_miss1 | 2016-06-16 15:05:00 | t | 2016-06-16
> 15:08:26.058631 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:07:00 | t | 2016-06-16
> 15:10:06.093253 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:09:00 | t | 2016-06-16
> 15:12:06.241773 | | t
> {code}
> - on a normal DAG, the SLA is being triggred more quickly :
> {code}
> // start_date = 2016-06-16 15:55:00
> // end_date = 2016-06-16 16:00:00
> // schedule_interval = timedelta(minutes=1)
> // sla = timedelta(minutes=2)
> task_id | dag_id | execution_date | email_sent |
> timestamp | description | notification_sent
> ----------+---------------+---------------------+------------+----------------------------+-------------+-------------------
> sla_miss | dag_sla_miss1 | 2016-06-16 15:55:00 | t | 2016-06-16
> 15:58:11.832299 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:56:00 | t | 2016-06-16
> 15:59:09.663778 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:57:00 | t | 2016-06-16
> 16:00:13.651422 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:58:00 | t | 2016-06-16
> 16:01:08.576399 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:59:00 | t | 2016-06-16
> 16:02:08.523486 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 16:00:00 | t | 2016-06-16
> 16:03:08.538593 | | t
> (6 rows)
> {code}
> than before (current master branch) :
> {code}
> // start_date = 2016-06-16 15:40:00
> // end_date = 2016-06-16 15:45:00
> // schedule_interval = timedelta(minutes=1)
> // sla = timedelta(minutes=2)
> task_id | dag_id | execution_date | email_sent |
> timestamp | description | notification_sent
> ----------+---------------+---------------------+------------+----------------------------+-------------+-------------------
> sla_miss | dag_sla_miss1 | 2016-06-16 15:41:00 | t | 2016-06-16
> 15:44:30.305287 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:42:00 | t | 2016-06-16
> 15:45:35.372118 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:43:00 | t | 2016-06-16
> 15:46:30.415744 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:44:00 | t | 2016-06-16
> 15:47:30.507345 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:45:00 | t | 2016-06-16
> 15:48:30.487742 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:46:00 | t | 2016-06-16
> 15:50:40.647373 | | t
> sla_miss | dag_sla_miss1 | 2016-06-16 15:47:00 | t | 2016-06-16
> 15:50:40.647373 | | t
> {code}
> Please note that in this last case (current master) execution_date is equal
> to dag.following_schedule, so SLA is being fired after one extra
> schedule_interval. Also note that SLA are still being triggered after
> end_date. Also note the timestamp column being updated seveal time.
> Please tell me what do you think about my patch.
> dud
--
This message was sent by Atlassian Jira
(v8.3.2#803003)