[jira] [Work started] (AIRFLOW-6648) Timeout Feature - Provided statistical solution to long running/stuck jobs and take appropriate actions

Golokesh Patra (Jira) Sun, 07 Jun 2020 21:03:12 -0700


     [ 
https://issues.apache.org/jira/browse/AIRFLOW-6648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Work on AIRFLOW-6648 started by Golokesh Patra.
-----------------------------------------------
> Timeout Feature - Provided statistical solution to long running/stuck jobs 
> and take appropriate actions
> -------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-6648
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6648
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: aws, DAG, database, operators
>    Affects Versions: 1.10.0
>         Environment: AWS Linux AMI -  Ubuntu 18.04.1 LTS (GNU/Linux 
> 4.15.0-1027-aws x86_64)
>            Reporter: Golokesh Patra
>            Assignee: Golokesh Patra
>            Priority: Minor
>         Attachments: image-2020-01-27-17-07-51-822.png, 
> image-2020-01-27-17-08-09-867.png, image-2020-01-27-17-08-33-088.png, 
> image-2020-01-27-17-22-07-433.png, image2019-3-25_12-33-57.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Sometimes, across different type of tasks/jobs ,
>  one might encounter issues where airflow jobs/tasks get stuck while they are 
> in running state.
>  Such issues will cause - Pipeline being stuck for no reason stalling other 
> jobs/tasks which will be a disaster when such issues happen on Production.
> This particular improvement aims to not only improve upon the TIMEOUT logic 
> already in airflow, but to make it more functional and automated.
> *Diagrammatically Explanation of the solution -* 
> !image-2020-01-27-17-22-07-433.png!
> *Detailed Theoretical Explanation -* 
> With increasing Data & Complexity of tasks/job , besides the increasing load, 
> the chances of memory leaks/stuck jobs/some infrastructural issues etc may 
> occur thereby creating some unwanted results.
>  Maybe on some day there was more data which resulted in a steep jump in the 
> duration of the job; otherwise, the growth is expected to be gradual.
>  And sometimes, the Jobs get stuck because of various issues and often 
> requires termination followed by a restart.
>  So, we are trying to make a logic which will automatically decide whether to
>  * _terminate the Job_
>  * _Terminate and Restart_
>  * _Terminate and Mark as a failure so that downstream jobs don't get 
> triggered._
>  * _Take no action and inform DevOps regarding the issue ( Manual Action )_
>  So, I just want to know, statistically, what will be the effective way to 
> achieve the above outcomes.
> Lets Consider 2 Jobs X & Y.
> Jobs related Info -
>  !image-2020-01-27-17-07-51-822.png!
> !image-2020-01-27-17-08-09-867.png!
> Then I was thinking of having a New Table which would be structured as -
> +Derived table-+ 
>  !image-2020-01-27-17-08-33-088.png!
> ( The above Example is theoretical and actual implementation might differ )
> *LIMITATION -* 
>  # For now , we have only tested the above on EMR ( Personal Usecase )
>  # Testing Pending for Databricks. ( Personal Usecase )
> Please do suggest any other services where this needs/can be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work started] (AIRFLOW-6648) Timeout Feature - Provided statistical solution to long running/stuck jobs and take appropriate actions

Reply via email to