Alejandro Fernandez created AIRFLOW-2659:
--------------------------------------------

             Summary: Improving Robustness of Operators in Airflow during Infra 
Outages
                 Key: AIRFLOW-2659
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
             Project: Apache Airflow
          Issue Type: Improvement
          Components: hooks
    Affects Versions: 1.10
            Reporter: Alejandro Fernandez
            Assignee: Alejandro Fernandez


*Problem:*
If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on 
those services will fail in Airflow, thereby causing SLA misses and 
deteriorating user confidence in Airflow (even if the outage was in another 
system). Only a fraction of tasks and DAGs have retries around certain 
operators/hooks and the retry attempts are not sufficient during an outage.

*Goal:* Automatically retry failures that occur due to infrastructure issues.

*High-level design:*
* Retry decorator in the Hooks for easy annotation
* Retry logic will be time-based (initial delay, max delay time, retry window, 
etc.)
* Allow each Hook to determine the root-cause of the error (user, infra outage)
** User-errors will be handled the way they are today.
** Infra-errors will be able to retry for extended periods of time.
* Configurable (feature toggle, configurable per Hook)
* Emit metrics using StatsD

See attached design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to