[GitHub] [incubator-airflow] afernandez commented on issue #3547: [AIRFLOW-2659] Improve Robustness of Operators in Airflow during Infra Outages

GitHub Mon, 17 Sep 2018 17:10:15 -0700

@Fokko My apologies for replying 2 months later (I was working on other high 
priority projects). 
Good question, the primary reason being that the retries in Airflow are mainly 
meant to handle transient errors where 3-5 retries suffice (or maybe 5 min 
window). This PR tries to address a larger infrastructure outage that can last 
several hours.


A user may have a legitimate case for only retrying 3 times (say a particular 
service is flaky at really high load).  Having shorter retries for transient 
errors ensures enough robustness for flaky services but not high enough that 
they completely mask unreliable services.

The solution I'm proposing tries to be more intelligent by applying business 
logic to the particular hook.
If it's indeed a transient-error, then retry according to the existing Airflow 
logic, but if it's a complete infrastructure outage, then perhaps retry for 2-4 
hours. Luckily, services like Hive, Presto, Spark, etc., can provide enough 
context to make this determination.

[ Full content available at: 
https://github.com/apache/incubator-airflow/pull/3547 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-airflow] afernandez commented on issue #3547: [AIRFLOW-2659] Improve Robustness of Operators in Airflow during Infra Outages

Reply via email to