[
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520048#comment-16520048
]
Alejandro Fernandez edited comment on AIRFLOW-2659 at 6/22/18 8:43 PM:
-----------------------------------------------------------------------
Hi [[email protected]], [~bolke], [~saguziel], [~yrqls21], [~artwr]
I'm proposing this feature and would greatly appreciate feedback on the design
doc. Happy to also share it on the dev mailing list.
Cheers,
Alejandro
was (Author: afernandez):
Hi [[email protected]], [~bolke], [~saguziel], [~yrqls21],
I'm proposing this feature and would greatly appreciate feedback on the design
doc. Happy to also share it on the dev mailing list.
Cheers,
Alejandro
> Improving Robustness of Operators in Airflow during Infra Outages
> -----------------------------------------------------------------
>
> Key: AIRFLOW-2659
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
> Project: Apache Airflow
> Issue Type: Improvement
> Components: hooks
> Affects Versions: 1.10
> Reporter: Alejandro Fernandez
> Assignee: Alejandro Fernandez
> Priority: Major
> Attachments: AIRFLOW_2659.pdf, retry_window.png, test_rules.py
>
>
> *Problem:*
> If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on
> those services will fail in Airflow, thereby causing SLA misses and
> deteriorating user confidence in Airflow (even if the outage was in another
> system). Only a fraction of tasks and DAGs have retries around certain
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
> * Retry decorator in the Hooks for easy annotation
> * Retry logic will be time-based (initial delay, max delay time, retry
> window, etc.)
> * Allow each Hook to determine the root-cause of the error (user, infra
> outage)
> ** User-errors will be handled the way they are today.
> ** Infra-errors will be able to retry for extended periods of time.
> * Configurable (optional, configurable per Hook)
> * Emit metrics using StatsD
> See attached [^AIRFLOW_2659.pdf]design doc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)