Re: [I] Add ability to disable a running DAG only after after it's in a finished state [airflow]

via GitHub Tue, 26 Nov 2024 07:46:23 -0800


potiuk commented on issue #22006:
URL: https://github.com/apache/airflow/issues/22006#issuecomment-2501021030


   > This is not true unless some strong assumptions that you are making:
   
   Well. Not really. Airflow already has the ability to run different DagRuns 
at the same time, so there is absolutly no way by default that one DagRun of 
the same run should impact another DagRun. They can be run sequentially or in 
parallel and other than explicitly setting `max_active_run =1` and few other 
task parameters, you have zero control on whether they are executed in parallel 
or not. The fact that Operators are idempotent (and yes I did not mix it with 
independent) - is because this is how Airflow has been designed initially. From 
https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/index.html:
   
   > An operator represents a single, ideally idempotent, task. Operators 
determine what actually executes when your DAG runs.
   
   And yes - while some operators are not idempotent, lack of idempotency 
breaks much of the functionality of Airflow that it was designed for (Re-runs, 
clearing, backfills and so on). And pretty much breaks "various DAG runs for 
the same DAG can be run in parallel". 
   
   > Remember that Airflow is a very flexible tool that is agnostic about what 
kind of tasks it runs and can be used for a huge variety of workflows. There is 
absolutely no reason to make such strong assumptions as tasks are idempotent, 
independent, backwards compatible when they are updated, etc.
   
   Of course it is not needed in a number of cses. And also I did not 
absolutely tell about being backwards compatible (this is your addition). With 
the proposed architecture of Airflow 3 and DAG versioning, the assumption of 
backwards compatibility here is precisely that is going away. While currently 
you absolutely need DAG backwards compatibility in Airflow when you evolve it 
(and this is why you need to pause it to upgrade DAG in non-compatible way) - 
this is precisely the assumption that "full" DAG versioning is going to address 
- you will NOT need DAG backwards compatibility between runs and this is what 
DAG versioning (full version of it) is going to provide.
   
   IMHO the proposed change will work nicely with justbasic assumption about 
DagRuns are kept. It does not even have to be idempotent in those cases to be 
honest, but it helps with mental model of DagRuns and schedule if the operators 
are idempotent. And this is by far most prolific case for which Airflow is used 
today and I still consider the other cases "niche" - my assesment here is still 
unchanged. With Airflow 3.0 we are getting a bit further indeed. For example we 
are removing execution_date, that was there primarily for that case and that 
will make DAG runs even more "independent" from schedule. And it will even 
increase the need for separation between the DAG runs - which means that almost 
by definition you should be able to run on DAG run with one version of code and 
another DAG run with another - without the need to pause the dags.  That's my 
assesment after participating in those discussions and even voting (as PMC 
member) on the versioning AIP. Not sure how much you were 
 involved and how much you read and understood about how versioning works, but 
if you think DAG versioning will not support your case (I think it should) - 
you should start discussion on the devlist and in the AIP suggesting the 
changes needed to support it - because that was primary reason why proposed and 
voted and are implementing DAG versioning. 
   
   > A typical example of what Airflow can be used for is updating a data model 
in a database (e.g. by running the dbt tool). Such tasks are not idempotent, 
nor independent, nor backwards compatible when updating the code.
   
   Sure, but in this case in case you have DAG runs that can do it , you have 
to do the "max_active_run=1" otherwise they will start running in paralel and 
in this case, any such update to DAG including versioning should be transparent 
for you - because you will never have one DAGRun with old code and the other 
DAG run with new code. Simply the currently active DAG run will continue 
running with the old version (including all tasks from that DAG run that are 
scheduled to run for this particular DAG Run) and all tasks for the "future" 
DAG run (which will not be running yet because of max_active_runs=1') will use 
the new code. So you will effectively achieve the same as pausing - without 
pausing.
   
   I think you are still rooted too much in the way how Airlfow 2 works work 
where you have absolutely no control which version of the code will be used by 
which DAG run's task - in Airlfow 3 when full versioning is implemented, this 
is precisely that is going to change. 
   
   But regardless - it does not really matter. - what matters now is that this 
"feature" is up for grabs for anyone. If you think you want this feature, you 
might absolutely propose to implement it (in Airflow 3 because there will not 
be a new feature request for Airflow 2). If there is no-one else among the 
maintiners who think it's worth implementing as Airflow 3 current efforts (most 
maintainers are heads down doing it) - then it needs **somoene** to roll their 
sleeves up and propose and contribute implementation of it (again - only in 
Airflow 3 - because there will not be Airlfow 2 feature release. So my opinion 
whether it's "niche" or "not" does not matter - as I am just one of the PMC 
members. What matters is whethere there is enough of a will and hands to work 
on this feature to implement it - also taking into account that likely (if my 
assesment is correct) - DAG Versioning will significantly decrease the need of 
having this weird pausing/unpausing scheme in the first place. 
   
   So - just to repeat - if you want to focus and implement this and make a PR, 
you are absolutely free to do it, but IMHO  this feature is largely obsoleted 
by what's coming in Airflow 3 so you are unlikely to find and advocate for 
someone who is involved in Airlflow 3 to spend their time on it. But you might 
if you want. And this is all what  I am saying.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Add ability to disable a running DAG only after after it's in a finished state [airflow]

Reply via email to