I would love to hear what others think about the "in/out" approach - mine is just the line of thoughts I've been exploring during the last few months where I prepared my own line of thought about providers, maintenance, incentive of entities maintaining open-source projects, and especially - expectations of the users that it creates. But those are just my thoughts and I'd love to hear what others think about it.
On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <[email protected]> wrote: > I had some thoughts about it - this also connected with recent discussions > about mixed governance for providers, and I think it's worth using this > discussion to set some rules and "boundaries" on when and how and > especially why we want to accept some contributions, while for some other > contributions it's better to be outside. > > We are about to start more seriously thinking (and discussing) on how to > split Airflow providers off airflow. And I think we can split off more than > providers - this might be a good candidate to be a standalone, but still > community maintained package. If we are going to solve the problem of > splitting airflow to N packages, one more package does not matter. > And it would nicely solve "version independence". We could even make it > airflow 2.0+ compliant if we want. > > So I think while the question of "is it tied with a specific airflow > version or not" does not really prevent us from making it part of community > - those two are not related (if we are going to have more repositories > anyway) > > The important part is really how "self-servicing" we can make it and how > we make sure it stays relevant with future versions of Airflow and who does > it I think - namely who has the incentive and "responsibility" to maintain > it. I am sure we will add more features to Airflow DAGs and simplify the > way DAGs are written over time, and the test harness will have to adapt to > it. > > There are pros and cons of having such a standalone package "in the > community/ASF project" and "out of it". We have a good example (from > similar kinds of tools/utils) in the past that we can learn from(and maybe > Bas can share more insights). > > https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs > > Initially that was "sponsored" by GoDataDriven where Bas worked and I > think this is where it was born. And that made sense as it was likely also > useful for the customers of GoDataDriven (here I am guessing). But > apparently both GoDataDriven's incentives winded down and it turned out > that usefulness of it was not as big (also I think we all in Python > community learned that Pylint is more of a distraction than real help - we > dumped Pylint eventually and the plugin was not maintained beyond some > versions of 1.10. And the tool is all but defunct now. Which is perfectly > understandable. > > In this case there is (I think) no risk of a "pylint" like problem, but > the question of maintenance and adaptation to future versions of Airflow > remains. > > I think there is one big differences of something that is "in ASF repos" > and "out": > > * if we make it a standalone package in "asf airflow community" - we will > have some obligation and expectations from our users to maintain it. We can > add some test harness (regardless if it will be in airflow repository or in > a separate one) to make sure that new airflow "core" changes will not break > it (and we can fail our PRs if they do - basically making "core" > maintainers take care about this problem rather than delegate it to someone > else to react on core changes (this is what has to happen with providers I > believe even if we split them to separate repo). I think anything that we > as the ASF community release should have such harnesses - making sure that > whatever we release and make available to our users work together. > > * if it is outside of the "ASF community", someone will have to react to > "core airflow" changes. We will not do it in the community, we will not pay > attention, such an "external tool" might break at any time because we > introduced a change in part of a core that the external tool implicitly > relied on. > > For me the question is whether something should be in/out should be based > on : > > * is it really useful for the community as a whole? -> if yes we should > consider it > * is it strongly tied with the core of airflow in the sense of relying on > some internals that might change easily? -> if not, there is no need to > bring it in, it can be easily maintained outside by anyone > * if it is strongly tied with the core - > is there someone (person, > organisation) who wants to take the burden of maintaining it and has > incentive of doing it for quite some time -> if yes, great, let them do > that! > * if it is strongly tied, do we want to take a burden as "core airflow > maintainers" to keep it updated together with the core if it is? -> if yes, > we should bring it in > > If we have a strongly tied tool that we do not want to maintain in the > core and there is no entity who would like to do it, then I think this idea > should be dropped :). > > J. > > > On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote: > >> Hi Pablo, >> >> Wow, I really love this idea. This will greatly enrich the airflow >> ecosystem. >> >> I agree with Ash, it is better to have it as a standalone package. And we >> can use this framework to write airflow core invariants tests, so that we >> will run them on every airflow release to guarantee no regressions. >> >> Thanks, >> >> Ping >> >> >> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <[email protected]> >> wrote: >> >>> Understood! >>> >>> TL;DR: I propose a testing framework where users can check for 'DAG >>> execution invariants' or 'DAG execution expectations' given certain task >>> outcomes. >>> >>> As DAGs grow in complexity, sometimes it might become difficult to >>> reason about their runtime behavior in many scenarios. Users may want to >>> lay out rules in the form of tests that can verify DAG execution results. >>> For example: >>> >>> - If any of my database_backup_* tasks fails, I want to ensure that at >>> least one email_alert_* task will run. >>> - If my 'check_authentication' task fails, I want to ensure that the >>> whole DAG will fail. >>> - If any of my DataflowOperator tasks fails, I want to ensure that a >>> PubsubOperator downstream will always run. >>> >>> These sorts of invariants don't need the DAG to be executed; but in >>> fact, they are pretty hard to test today: Staging environments can't check >>> every possible runtime outcome. >>> >>> In this framework, users would define unit tests like this: >>> >>> ``` >>> def test_my_example_dag(): >>> the_dag = models.DAG( >>> 'the_basic_dag', >>> schedule_interval='@daily', >>> start_date=DEFAULT_DATE, >>> ) >>> >>> with the_dag: >>> op1 = EmptyOperator(task_id='task_1') >>> op2 = EmptyOperator(task_id='task_2') >>> op3 = EmptyOperator(task_id='task_3') >>> >>> op1 >> op2 >> op3 >>> # DAG invariant: If task_1 and task_2 succeeds, then task_3 will >>> always run >>> assert_that( >>> given(thedag)\ >>> .when(task('task_1'), succeeds())\ >>> .and_(task('task_2'), succeeds())\ >>> .then(task('task_3'), runs())) >>> ``` >>> >>> This is a very simple example - and it's not great, because it only >>> duplicates the DAG logic - but you can see more examples in my draft PR >>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1] >>> and in my draft AIP >>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g> >>> [2]. >>> >>> I started writing up an AIP in a Google doc[2] which y'all can check. >>> It's very close to what I have written here : ) >>> >>> LMK what y'all think. I am also happy to publish this as a separate >>> library if y'all wanna be cautious about adding it directly to Airflow. >>> -P. >>> >>> [1] >>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82 >>> [2] >>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit# >>> >>> >>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote: >>> >>>> Yep. Just outline your proposal on devlist, Pablo :). >>>> >>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]> >>>> wrote: >>>> > >>>> > Hi Pablo, >>>> > >>>> > Could you describe at a high level what you are thinking of? It's >>>> entirely possible it doesn't need any changes to core Airflow, or isn't >>>> significant enough to need an AIP. >>>> > >>>> > Thanks, >>>> > Ash >>>> > >>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada >>>> <[email protected]> wrote: >>>> >> >>>> >> Hi there! >>>> >> I would like to start a discussion of an idea that I had for a >>>> testing framework for airflow. >>>> >> I believe the first step would be to write up an AIP - so could I >>>> have access to write a new one on the cwiki? >>>> >> >>>> >> Thanks! >>>> >> -P. >>>> >>>
