I had some thoughts about it - this also connected with recent discussions about mixed governance for providers, and I think it's worth using this discussion to set some rules and "boundaries" on when and how and especially why we want to accept some contributions, while for some other contributions it's better to be outside.
We are about to start more seriously thinking (and discussing) on how to split Airflow providers off airflow. And I think we can split off more than providers - this might be a good candidate to be a standalone, but still community maintained package. If we are going to solve the problem of splitting airflow to N packages, one more package does not matter. And it would nicely solve "version independence". We could even make it airflow 2.0+ compliant if we want. So I think while the question of "is it tied with a specific airflow version or not" does not really prevent us from making it part of community - those two are not related (if we are going to have more repositories anyway) The important part is really how "self-servicing" we can make it and how we make sure it stays relevant with future versions of Airflow and who does it I think - namely who has the incentive and "responsibility" to maintain it. I am sure we will add more features to Airflow DAGs and simplify the way DAGs are written over time, and the test harness will have to adapt to it. There are pros and cons of having such a standalone package "in the community/ASF project" and "out of it". We have a good example (from similar kinds of tools/utils) in the past that we can learn from(and maybe Bas can share more insights). https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs Initially that was "sponsored" by GoDataDriven where Bas worked and I think this is where it was born. And that made sense as it was likely also useful for the customers of GoDataDriven (here I am guessing). But apparently both GoDataDriven's incentives winded down and it turned out that usefulness of it was not as big (also I think we all in Python community learned that Pylint is more of a distraction than real help - we dumped Pylint eventually and the plugin was not maintained beyond some versions of 1.10. And the tool is all but defunct now. Which is perfectly understandable. In this case there is (I think) no risk of a "pylint" like problem, but the question of maintenance and adaptation to future versions of Airflow remains. I think there is one big differences of something that is "in ASF repos" and "out": * if we make it a standalone package in "asf airflow community" - we will have some obligation and expectations from our users to maintain it. We can add some test harness (regardless if it will be in airflow repository or in a separate one) to make sure that new airflow "core" changes will not break it (and we can fail our PRs if they do - basically making "core" maintainers take care about this problem rather than delegate it to someone else to react on core changes (this is what has to happen with providers I believe even if we split them to separate repo). I think anything that we as the ASF community release should have such harnesses - making sure that whatever we release and make available to our users work together. * if it is outside of the "ASF community", someone will have to react to "core airflow" changes. We will not do it in the community, we will not pay attention, such an "external tool" might break at any time because we introduced a change in part of a core that the external tool implicitly relied on. For me the question is whether something should be in/out should be based on : * is it really useful for the community as a whole? -> if yes we should consider it * is it strongly tied with the core of airflow in the sense of relying on some internals that might change easily? -> if not, there is no need to bring it in, it can be easily maintained outside by anyone * if it is strongly tied with the core - > is there someone (person, organisation) who wants to take the burden of maintaining it and has incentive of doing it for quite some time -> if yes, great, let them do that! * if it is strongly tied, do we want to take a burden as "core airflow maintainers" to keep it updated together with the core if it is? -> if yes, we should bring it in If we have a strongly tied tool that we do not want to maintain in the core and there is no entity who would like to do it, then I think this idea should be dropped :). J. On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote: > Hi Pablo, > > Wow, I really love this idea. This will greatly enrich the airflow > ecosystem. > > I agree with Ash, it is better to have it as a standalone package. And we > can use this framework to write airflow core invariants tests, so that we > will run them on every airflow release to guarantee no regressions. > > Thanks, > > Ping > > > On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <[email protected]> > wrote: > >> Understood! >> >> TL;DR: I propose a testing framework where users can check for 'DAG >> execution invariants' or 'DAG execution expectations' given certain task >> outcomes. >> >> As DAGs grow in complexity, sometimes it might become difficult to reason >> about their runtime behavior in many scenarios. Users may want to lay out >> rules in the form of tests that can verify DAG execution results. For >> example: >> >> - If any of my database_backup_* tasks fails, I want to ensure that at >> least one email_alert_* task will run. >> - If my 'check_authentication' task fails, I want to ensure that the >> whole DAG will fail. >> - If any of my DataflowOperator tasks fails, I want to ensure that a >> PubsubOperator downstream will always run. >> >> These sorts of invariants don't need the DAG to be executed; but in fact, >> they are pretty hard to test today: Staging environments can't check every >> possible runtime outcome. >> >> In this framework, users would define unit tests like this: >> >> ``` >> def test_my_example_dag(): >> the_dag = models.DAG( >> 'the_basic_dag', >> schedule_interval='@daily', >> start_date=DEFAULT_DATE, >> ) >> >> with the_dag: >> op1 = EmptyOperator(task_id='task_1') >> op2 = EmptyOperator(task_id='task_2') >> op3 = EmptyOperator(task_id='task_3') >> >> op1 >> op2 >> op3 >> # DAG invariant: If task_1 and task_2 succeeds, then task_3 will >> always run >> assert_that( >> given(thedag)\ >> .when(task('task_1'), succeeds())\ >> .and_(task('task_2'), succeeds())\ >> .then(task('task_3'), runs())) >> ``` >> >> This is a very simple example - and it's not great, because it only >> duplicates the DAG logic - but you can see more examples in my draft PR >> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1] >> and in my draft AIP >> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g> >> [2]. >> >> I started writing up an AIP in a Google doc[2] which y'all can check. >> It's very close to what I have written here : ) >> >> LMK what y'all think. I am also happy to publish this as a separate >> library if y'all wanna be cautious about adding it directly to Airflow. >> -P. >> >> [1] >> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82 >> [2] >> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit# >> >> >> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote: >> >>> Yep. Just outline your proposal on devlist, Pablo :). >>> >>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]> >>> wrote: >>> > >>> > Hi Pablo, >>> > >>> > Could you describe at a high level what you are thinking of? It's >>> entirely possible it doesn't need any changes to core Airflow, or isn't >>> significant enough to need an AIP. >>> > >>> > Thanks, >>> > Ash >>> > >>> > On 17 July 2022 07:43:54 BST, Pablo Estrada <[email protected]> >>> wrote: >>> >> >>> >> Hi there! >>> >> I would like to start a discussion of an idea that I had for a >>> testing framework for airflow. >>> >> I believe the first step would be to write up an AIP - so could I >>> have access to write a new one on the cwiki? >>> >> >>> >> Thanks! >>> >> -P. >>> >>
