[jira] [Commented] (AIRFLOW-3285) lazy marking of upstream_failed task state

Ash Berlin-Taylor (JIRA) Tue, 06 Nov 2018 09:00:50 -0800


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677006#comment-16677006
 ]


Ash Berlin-Taylor commented on AIRFLOW-3285:
--------------------------------------------

The lazy feature as you have described it isn't something we'd accept as it's 
quite a behaviour change and a little bit of a work-ardound, but a combo 
trigger rule so we could do say {{trigger_rule=\{'all_done','one_failed',\}}} 
to say "trigger on any of these conditions" would be acceptable

> lazy marking of upstream_failed task state
> ------------------------------------------
>
>                 Key: AIRFLOW-3285
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3285
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: Kevin McHale
>            Priority: Minor
>
> Airflow aggressively applies the {{upstream_failed}} task state: as soon as a 
> task fails, all of its downstream dependencies get marked.  This sometimes 
> creates problems for us at Etsy.
> In particular, we use a pattern for our hadoop Airflow DAGs along these lines:
>  # the DAG creates a hadoop cluster in GCP/Dataproc
>  # the DAG executes its tasks on the cluster
>  # the DAG deletes the cluster once all tasks are done
> There are some cases in which the tasks immediately upstream of the 
> cluster-delete step get marked as {{upstream_failed}}, triggering the 
> cluster-delete step, even while other tasks continue to execute without 
> problems on the cluster.  The cluster-delete step of course kills all of the 
> running tasks, requiring all of them to be re-run once the problem with the 
> failed task is mitigated.
> As an example, a DAG that looks like this can exhibit the problem:
> {code:java}
> Cluster = ClusterCreateOperator(...)
> A = Job1Operator(...)
> Cluster << A
> B = Job2Operator(...)
> Cluster << B
> C = Job3Operator(...)
> A << C
> B << C
> ClusterDelete = DeleteClusterOperator(trigger_rule="all_done", ...)
> D << ClusterDelete{code}
> In a DAG like this, suppose task A fails while task B is running.  Task C 
> will immediately be marked as {{upstream_failed}}, which will cause 
> ClusterDelete to run while task B is still running, which will cause task B 
> to also fail.
> Our solution to this problem has been to implement something like [this 
> diff|https://github.com/mchalek/incubator-airflow/commit/585349018656cd9b2e3e3e113db6412345485dde],
>  which lazily applies the {{upstream_failed}} state only to tasks for which 
> all upstream tasks have already completed.
> The consequence in terms of the example above is that task C will not be 
> marked {{upstream_failed}} in response to task A failing until task B 
> completes, ensuring that the cluster is not deleted while any upstream tasks 
> are running.
> We find this not to have any adverse behavior on our airflow instances, so we 
> run all of them with this lazy-marking feature enabled.  However, we 
> recognize that a change in behavior like this may be something that existing 
> users will want to opt-in for, so we included a config flag in the diff that 
> defaults to the original behavior.
> We would appreciate your consideration of incorporating this diff, or 
> something like it, to allow us to configure this behavior in unmodified, 
> upstream airflow.
> Thanks!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-3285) lazy marking of upstream_failed task state

Reply via email to