[ 
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927767#comment-16927767
 ] 

ASF GitHub Bot commented on AIRFLOW-2659:
-----------------------------------------

stale[bot] commented on pull request #3547: [AIRFLOW-2659] Improve Robustness 
of Operators in Airflow during Infra Outages
URL: https://github.com/apache/airflow/pull/3547
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Improve Robustness of Operators in Airflow during Infra Outages
> ---------------------------------------------------------------
>
>                 Key: AIRFLOW-2659
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: hooks
>    Affects Versions: 2.0.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Major
>         Attachments: AIRFLOW_2659.pdf, retry_window.png, test_rules.py
>
>
> *Problem:*
>  If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on 
> those services will fail in Airflow, thereby causing SLA misses and 
> deteriorating user confidence in Airflow (even if the outage was in another 
> system). Only a fraction of tasks and DAGs have retries around certain 
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
>  * Retry decorator in the Hooks for easy annotation
>  * Retry logic will be time-based (initial delay, max delay time, retry 
> window, etc.)
>  * Allow each Hook to determine the root-cause of the error (user, infra 
> outage)
>  ** User-errors will be handled the way they are today.
>  ** Infra-errors will be able to retry for extended periods of time.
>  * Configurable (optional, configurable per Hook)
>  * Emit metrics using StatsD
> See attached [^AIRFLOW_2659.pdf]design doc.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to