Re: [DISCUSS] When a DAG is paused, change the dag run state from running to failed.

Brent Bovenzi Thu, 03 Apr 2025 11:28:47 -0700

The issue is that duration is based off of start and end dates. If there is
no end date we usually default to now. But that is misleading when a dag
run is running but the dag is paused.
Let me take a look at where we use duration in the 3.0 UI and see if we can
reduce that confusion. We don't have the "5 longest dag runs" in our new
dashboard page, which replaces cluster activity. If we wanted that feature
again, we should be mindful of this and filter out paused dags in the API
request.




On Thu, Apr 3, 2025, 1:27 PM Pedro Nunes Leal
<[email protected]> wrote:

> A 2025-03-31 22:26, Jens Scheffler escreveu:
> > Hi,
> >
> > thanks for working on the bug and raising a PR to fix it.
> >
> > As other commiters also commented I think from product view I'd expect
> > a
> > different resolution. We use the "Pause DAG" in most cases for
> > administrative or infrastructure problems to prevent further failures
> > and/or to drain infra to switch some backend.
> >
> > I assume when we pause a long-running DAG that is in-between execution
> > of tasks we want to really "pause" scheduling, we don't want to set it
> > to failed. That would also not be correct because once we un-pause the
> > running DAGs should continoue to work. I see no reason marking this
> > failed anf then manually running behind to reset the state later.
> >
> > My view on this is that as also proposed in the discussion of the bug,
> > we should rather filter the paused DAG from clouster activity reporting
> > such that paused DAGs are not reported with excessive runtime. Also
> > later if un-paused it would be "right" that the overall DAG runtime was
> > longer than normal (would not expect to deduct the paused time from
> > runtime of the DAG.)
> >
> > If I want (as operator/admin) to really terminate existing running
> > instances I'd rather walk through Browse -> DAG Runs --> Filter for
> > running with paused DAG id and mark them as failed explicitly.
> >
> > Jens
> >
> > On 31.03.25 20:50, Pedro Nunes Leal wrote:
> >> Hello everyone,
> >>
> >> Currently, I'm trying to fix this bug:
> >> https://github.com/apache/airflow/issues/44443
> >>
> >> Basically, the issue is that the DAGs would be stuck on running even
> >> though they were paused.
> >> Consequently, the duration of the dag run will keep on increasing even
> >> though the DAG is paused.
> >>
> >> My proposal to solve this problem is changing the DAGs state from
> >> running to failed, when paused, to avoid the increment of their
> >> duration.
> >>
> >> Since this can be an impactful change, I would like to hear what
> >> others think about it.
> >>
> >> Link for the Pull Request:
> >> https://github.com/apache/airflow/pull/47557
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> That can be a better approach.
>
> However, if I'm not mistaken, the code related to the cluster activity
> page doesn't exist in Airflow 3 (the version where I'm trying to do the
> changes).
>
> So what should I do in this case?
> Is there any other way not involving cluster activity to solve this
> problem?
>
> The change to queued state instead of fail was my proposal at the
> beginning, and it really pauses the DAG.
> This is the type of solution I was thinking, because as I said before in
> the pull request, I feel that the cluster activity behavior is just a
> symptom from a bigger problem (the DAGs doesn't really pause, they just
> keep running).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] When a DAG is paused, change the dag run state from running to failed.

Reply via email to