[
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520768#comment-16520768
]
Alejandro Fernandez commented on AIRFLOW-2659:
----------------------------------------------
I came across https://issues.apache.org/jira/browse/AIRFLOW-1620, which was
created by [~aoen], and is essentially the same ask as this Jira.
> Improve Robustness of Operators in Airflow during Infra Outages
> ---------------------------------------------------------------
>
> Key: AIRFLOW-2659
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
> Project: Apache Airflow
> Issue Type: Improvement
> Components: hooks
> Affects Versions: 1.10
> Reporter: Alejandro Fernandez
> Assignee: Alejandro Fernandez
> Priority: Major
> Attachments: AIRFLOW_2659.pdf, retry_window.png, test_rules.py
>
>
> *Problem:*
> If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on
> those services will fail in Airflow, thereby causing SLA misses and
> deteriorating user confidence in Airflow (even if the outage was in another
> system). Only a fraction of tasks and DAGs have retries around certain
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
> * Retry decorator in the Hooks for easy annotation
> * Retry logic will be time-based (initial delay, max delay time, retry
> window, etc.)
> * Allow each Hook to determine the root-cause of the error (user, infra
> outage)
> ** User-errors will be handled the way they are today.
> ** Infra-errors will be able to retry for extended periods of time.
> * Configurable (optional, configurable per Hook)
> * Emit metrics using StatsD
> See attached [^AIRFLOW_2659.pdf]design doc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)