[ 
https://issues.apache.org/jira/browse/AIRFLOW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520632#comment-16520632
 ] 

Alejandro Fernandez commented on AIRFLOW-2659:
----------------------------------------------

hi [~ashb], good question. I'll try my best to elaborate,
 !retry_window.png! 

The existing method works well as a catch-all for shorter time windows since 
owners do set these attributes, e.g.,
{code}
default_args = {
  'retries': 3,
  'retry_delay': timedelta(minutes=5),
  'retry_exponential_backoff': True,
  'max_retry_delay': timedelta(minutes=15),
}
{code}

In practice, there’s a balance between picking a reasonable number of retries 
to be robust enough when a transient/infra error occurs, but not so high that 
it will mask flaky code.

Further, retries happen at the Task-level, so a failed task is rescheduled by 
putting it at the end of the queue where it has to wait for an open slot.

Also, the existing logic will retry regardless of the type of error. If we 
increased the retry values of the DAGs, then we may end up masking flaky code 
that will simply keep retrying for longer, thereby consuming more cluster 
resources (slots, hardware). 

Instead, we want to have more granular control around retries such that infra 
outages will retry for an extended time interval without having to requeue the 
task or clear its downstream dependencies.


> Improving Robustness of Operators in Airflow during Infra Outages
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-2659
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2659
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: hooks
>    Affects Versions: 1.10
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Major
>         Attachments: AIRFLOW_2659.pdf, retry_window.png, test_rules.py
>
>
> *Problem:*
>  If an infrastructure outage occurs on the Hadoop cluster, tasks that rely on 
> those services will fail in Airflow, thereby causing SLA misses and 
> deteriorating user confidence in Airflow (even if the outage was in another 
> system). Only a fraction of tasks and DAGs have retries around certain 
> operators/hooks and the retry attempts are not sufficient during an outage.
> *Goal:* Automatically retry failures that occur due to infrastructure issues.
> *High-level design:*
>  * Retry decorator in the Hooks for easy annotation
>  * Retry logic will be time-based (initial delay, max delay time, retry 
> window, etc.)
>  * Allow each Hook to determine the root-cause of the error (user, infra 
> outage)
>  ** User-errors will be handled the way they are today.
>  ** Infra-errors will be able to retry for extended periods of time.
>  * Configurable (feature toggle, configurable per Hook)
>  * Emit metrics using StatsD
> See attached [^AIRFLOW_2659.pdf]design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to