[jira] [Commented] (AIRFLOW-249) Refactor the SLA mechanism

ASF GitHub Bot (Jira) Wed, 11 Sep 2019 09:38:26 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927793#comment-16927793
 ]


ASF GitHub Bot commented on AIRFLOW-249:
----------------------------------------

Eronarn commented on pull request #3584: [AIRFLOW-249] Refactor the SLA 
mechanism
URL: https://github.com/apache/airflow/pull/3584
 
 
   ### JIRA
   
   AIRFLOW-249 Refactor the SLA mechanism
   
   This is a fairly large patch that should and/or may also address:
   
   AIRFLOW-1360 SLA miss goes to all emails in DAG
   AIRFLOW-2236 Airflow SLA is triggered for all backfilled tasks
   AIRFLOW-557 SLA notification does not work for manually triggered DAGs
   AIRFLOW-133 SLAs don't seem to work with schedule_interval=None
   AIRFLOW-1013 airflow/jobs.py:manage_slas() exception for @once dag
   
   ### Description
   
   At Quantopian we use Airflow to produce artifacts based on the previous 
day's stock market data. These artifacts are required for us to trade on 
today's stock market. Therefore, I've been investing time in improving Airflow 
notifications (such as writing PagerDuty and Slack integrations). My attention 
has turned to Airflow's SLA system, which has some drawbacks for our use case:
   
   1) Defining SLAs can be awkward because they are relative to the execution 
date instead of the task start time. There's no way to alert if a task runs for 
"more than an hour", for any non-trivial DAG. Instead you can only express 
"more than an hour from execution date".  The financial data we use varies in 
when it arrives, and how long it takes to process (data volume changes 
frequently). We also run DAGs with mixed UTC and Eastern events, making Airflow 
SLA definitions depend on time of year.
   
   2) Execution timeouts can capture "more than an hour" but do not serve the 
same purpose as SLAs. We have tight timelines on long-duration tasks that make 
retries difficult, so we want to alert an operator while leaving the original 
task running, rather than failing and then alerting.
   
   3) The way that SLA miss callbacks are defined is not intuitive, as in 
contrast to other callbacks, they are defined on the DAG rather than on the 
task. This has lead to a lot of confusion in JIRA/the mailing list; many people 
think that SLAs are for DAG completion, when in reality it's a DAG-level 
attribute that handles batched task-level completions. Also, the call signature 
is poorly defined: for instance, two of the arguments are just strings produced 
from the other two arguments.
   
   4) SLA miss emails don't include any links back to the Airflow instance 
(important for us because we run the same DAGs in both staging/production) or 
the execution date they apply to. When opened, they be hard to read for even a 
moderately sized DAG because they include a flat list of task instances that 
are unsorted (neither alpha nor topo).
   
   5) SLA miss emails are sent to every email address associated with the DAG. 
This can lead to inadvertent paging of users associated with unrelated "forks" 
in the DAG from where the SLA miss failed.
   
   6) SLA emails are not callbacks, and can't be turned off (other than either 
removing the SLA or removing the email attribute on the task instance).
   
   This patch attempts to address the above issues by making some of the 
following changes:
   
   1) The `sla=` parameter is split into:
   `expected_start`: Timedelta after execution date, representing when this 
task must have started by.
   `expected_finish`: Timedelta after execution date, representing when this 
task must have finished by.
   `expected_duration`: Timedelta after task start, representing how long this 
task is expected to run, including all retries.
   
   These parameters are set on a task (or DAG-level default args), and a task 
can have any combination of them, though there is some basic validation logic 
to warn you if you try to set an illogical combination. The SlaMiss object 
stores the type of SLA miss as a new database field, which is a component of 
the primary key.
   
   There is logic to convert the existing `sla=` parameter to `expected_finish` 
(as well as a migration), since that's the closest parallel, so it should be 
relatively backwards compatible.
   
   2) SLA misses are no longer grouped for checks or callbacks. While there 
have always been independent per-task SlaMiss objects, there was a lot of logic 
to poll for all SlaMisses that occurred at the same time, and to batch them 
into a single email.
   
   3) As a consequence of 2), The `sla_miss_callback` is no longer set on the 
DAG level, which has been confusing. It now has a context-based signature to be 
consistent with other task callbacks. This change is not backwards compatible 
for anyone using custom SLA miss callbacks, but should be a fairly 
straightforward conversion.
   
   4) The SLA miss email is now the default SLA miss callback on tasks. 
Previously it was an additional non-overrideable feature.
   
   5) The SLA miss email has some improvements:
   - Only one SLA miss per email
   - SLA-miss-specific title
   - Includes a link to the task instance
   - Only includes potentially-blocked downstreams
   - Sends to a list of "interested subscribers", which is defined as all email 
addresses on tasks downstream of the task that missed its SLA.
   - Additional ASCII art to help distinguish emails.
   
   6) Move the SLA miss code largely out of the Scheduler code: some into 
models (DAGs manage their own SlaMiss objects), and some into SLA helper 
functions. 
   
   7) Overall, attempt to reduce the complexity and lack of documentation of 
the SLA miss logic, given the constraint that the new implementation is a 
larger feature and more lines of code. The previous implementation was stuffed 
into one overloaded function that is responsible for checking for SLA misses, 
creating database objects for them, filtering tasks, selecting emails, 
rendering, and sending. These are now broken into multiple functions, which 
attempt to be more single-purpose.
   
   ### Tests
   I have rewritten the existing tests to match this patch; I am open to adding 
any required additional tests. (The tests probably also want to be moved out of 
the scheduler suite.)
   
   ### Commits
   - [x] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)"
   
   ### Documentation
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
   
   (Sort of - I still need to rewrite the high-level, on-website SLA 
documentation if this gets approved.)
   
   ### Code Quality
   - [x] Passes `git diff upstream/master -u -- "*.py" | flake8 --diff`
   
   (except for a part I did not touch, and the core alembic migration template 
generating flake8 errors...)
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Refactor the SLA mechanism
> --------------------------
>
>                 Key: AIRFLOW-249
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-249
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: dud
>            Priority: Major
>
> Hello
> I've noticed the SLA feature is currently behaving as follow :
> - it doesn't work on DAG scheduled @once or None because they have no 
> dag.followwing_schedule property
> - it keeps endlessly checking for SLA misses without ever worrying about any 
> end_date. Worse I noticed that emails are still being sent for runs that are 
> never happening because of end_date
> - it keeps checking for recent TIs even if SLA notification has been already 
> been sent for them
> - the SLA logic is only being fired after following_schedule + sla has 
> elapsed, in other words one has to wait for the next TI before having a 
> chance of getting any email. Also the email reports dag.following_schedule 
> time (I guess because it is close of TI.start_date), but unfortunately that 
> doesn't match what the task instances shows nor the log filename
> - the SLA logic is based on max(TI.execution_date) for the starting point of 
> its checks, that means that for a DAG whose SLA is longer than its schedule 
> period if half of the TIs are running longer than expected it will go 
> unnoticed. This could be demonstrated with a DAG like this one :
> {code}
> from airflow import DAG
> from airflow.operators import *
> from datetime import datetime, timedelta
> from time import sleep
> default_args = {
>     'owner': 'airflow',
>     'depends_on_past': False,
>     'start_date': datetime(2016, 6, 16, 12, 20),
>     'email': my_email
>     'sla': timedelta(minutes=2),
> }
> dag = DAG('unnoticed_sla', default_args=default_args, 
> schedule_interval=timedelta(minutes=1))
> def alternating_sleep(**kwargs):
>     minute = kwargs['execution_date'].strftime("%M")
>     is_odd = int(minute) % 2
>     if is_odd:
>         sleep(300)
>     else:
>         sleep(10)
>     return True
> PythonOperator(
>     task_id='sla_miss',
>     python_callable=alternating_sleep,
>     provide_context=True,
>     dag=dag)
> {code}
> I've tried to rework the SLA triggering mechanism by addressing the above 
> points., please [have a look on 
> it|https://github.com/dud225/incubator-airflow/commit/972260354075683a8d55a1c960d839c37e629e7d]
> I made some tests with this patch :
> - the fluctuent DAG shown above no longer make Airflow skip any SLA event :
> {code}
>  task_id  |    dag_id     |   execution_date    | email_sent |         
> timestamp          | description | notification_sent 
> ----------+---------------+---------------------+------------+----------------------------+-------------+-------------------
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:05:00 | t          | 2016-06-16 
> 15:08:26.058631 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:07:00 | t          | 2016-06-16 
> 15:10:06.093253 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:09:00 | t          | 2016-06-16 
> 15:12:06.241773 |             | t
> {code}
> - on a normal DAG, the SLA is being triggred more quickly :
> {code}
> // start_date = 2016-06-16 15:55:00
> // end_date = 2016-06-16 16:00:00
> // schedule_interval =  timedelta(minutes=1)
> // sla = timedelta(minutes=2)
>  task_id  |    dag_id     |   execution_date    | email_sent |         
> timestamp          | description | notification_sent 
> ----------+---------------+---------------------+------------+----------------------------+-------------+-------------------
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:55:00 | t          | 2016-06-16 
> 15:58:11.832299 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:56:00 | t          | 2016-06-16 
> 15:59:09.663778 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:57:00 | t          | 2016-06-16 
> 16:00:13.651422 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:58:00 | t          | 2016-06-16 
> 16:01:08.576399 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:59:00 | t          | 2016-06-16 
> 16:02:08.523486 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 16:00:00 | t          | 2016-06-16 
> 16:03:08.538593 |             | t
> (6 rows)
> {code}
> than before (current master branch) :
> {code}
> // start_date = 2016-06-16 15:40:00
> // end_date = 2016-06-16 15:45:00
> // schedule_interval =  timedelta(minutes=1)
> // sla = timedelta(minutes=2)
>  task_id  |    dag_id     |   execution_date    | email_sent |         
> timestamp          | description | notification_sent 
> ----------+---------------+---------------------+------------+----------------------------+-------------+-------------------
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:41:00 | t          | 2016-06-16 
> 15:44:30.305287 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:42:00 | t          | 2016-06-16 
> 15:45:35.372118 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:43:00 | t          | 2016-06-16 
> 15:46:30.415744 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:44:00 | t          | 2016-06-16 
> 15:47:30.507345 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:45:00 | t          | 2016-06-16 
> 15:48:30.487742 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:46:00 | t          | 2016-06-16 
> 15:50:40.647373 |             | t
>  sla_miss | dag_sla_miss1 | 2016-06-16 15:47:00 | t          | 2016-06-16 
> 15:50:40.647373 |             | t
> {code}
> Please note that in this last case (current master) execution_date is equal 
> to dag.following_schedule, so SLA is being fired after one extra 
> schedule_interval. Also note that SLA are still being triggered after 
> end_date. Also note the timestamp column being updated seveal time.
> Please tell me what do you think about my patch.
> dud



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (AIRFLOW-249) Refactor the SLA mechanism

Reply via email to