What do you think, Pablo about the "being out" vs. "being in" the official repo?
On Thu, Jul 28, 2022 at 3:51 PM Jarek Potiuk <[email protected]> wrote: > Anyone :) ? > > On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <[email protected]> wrote: > >> I would love to hear what others think about the "in/out" approach - mine >> is just the line of thoughts I've been exploring during the last few months >> where I prepared my own line of thought about providers, maintenance, >> incentive of entities maintaining open-source projects, and especially - >> expectations of the users that it creates. But those are just my thoughts >> and I'd love to hear what others think about it. >> >> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <[email protected]> wrote: >> >>> I had some thoughts about it - this also connected with recent >>> discussions about mixed governance for providers, and I think it's worth >>> using this discussion to set some rules and "boundaries" on when and >>> how and especially why we want to accept some contributions, while for some >>> other contributions it's better to be outside. >>> >>> We are about to start more seriously thinking (and discussing) on how to >>> split Airflow providers off airflow. And I think we can split off more than >>> providers - this might be a good candidate to be a standalone, but still >>> community maintained package. If we are going to solve the problem of >>> splitting airflow to N packages, one more package does not matter. >>> And it would nicely solve "version independence". We could even make it >>> airflow 2.0+ compliant if we want. >>> >>> So I think while the question of "is it tied with a specific airflow >>> version or not" does not really prevent us from making it part of community >>> - those two are not related (if we are going to have more repositories >>> anyway) >>> >>> The important part is really how "self-servicing" we can make it and how >>> we make sure it stays relevant with future versions of Airflow and who does >>> it I think - namely who has the incentive and "responsibility" to maintain >>> it. I am sure we will add more features to Airflow DAGs and simplify the >>> way DAGs are written over time, and the test harness will have to adapt to >>> it. >>> >>> There are pros and cons of having such a standalone package "in the >>> community/ASF project" and "out of it". We have a good example (from >>> similar kinds of tools/utils) in the past that we can learn from(and maybe >>> Bas can share more insights). >>> >>> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs >>> >>> Initially that was "sponsored" by GoDataDriven where Bas worked and I >>> think this is where it was born. And that made sense as it was likely also >>> useful for the customers of GoDataDriven (here I am guessing). But >>> apparently both GoDataDriven's incentives winded down and it turned out >>> that usefulness of it was not as big (also I think we all in Python >>> community learned that Pylint is more of a distraction than real help - we >>> dumped Pylint eventually and the plugin was not maintained beyond some >>> versions of 1.10. And the tool is all but defunct now. Which is perfectly >>> understandable. >>> >>> In this case there is (I think) no risk of a "pylint" like problem, but >>> the question of maintenance and adaptation to future versions of Airflow >>> remains. >>> >>> I think there is one big differences of something that is "in ASF repos" >>> and "out": >>> >>> * if we make it a standalone package in "asf airflow community" - we >>> will have some obligation and expectations from our users to maintain it. >>> We can add some test harness (regardless if it will be in airflow >>> repository or in a separate one) to make sure that new airflow "core" >>> changes will not break it (and we can fail our PRs if they do - basically >>> making "core" maintainers take care about this problem rather than delegate >>> it to someone else to react on core changes (this is what has to happen >>> with providers I believe even if we split them to separate repo). I think >>> anything that we as the ASF community release should have such harnesses - >>> making sure that whatever we release and make available to our users work >>> together. >>> >>> * if it is outside of the "ASF community", someone will have to react to >>> "core airflow" changes. We will not do it in the community, we will not pay >>> attention, such an "external tool" might break at any time because we >>> introduced a change in part of a core that the external tool implicitly >>> relied on. >>> >>> For me the question is whether something should be in/out should be >>> based on : >>> >>> * is it really useful for the community as a whole? -> if yes we should >>> consider it >>> * is it strongly tied with the core of airflow in the sense of relying >>> on some internals that might change easily? -> if not, there is no need to >>> bring it in, it can be easily maintained outside by anyone >>> * if it is strongly tied with the core - > is there someone (person, >>> organisation) who wants to take the burden of maintaining it and has >>> incentive of doing it for quite some time -> if yes, great, let them do >>> that! >>> * if it is strongly tied, do we want to take a burden as "core airflow >>> maintainers" to keep it updated together with the core if it is? -> if yes, >>> we should bring it in >>> >>> If we have a strongly tied tool that we do not want to maintain in the >>> core and there is no entity who would like to do it, then I think this idea >>> should be dropped :). >>> >>> J. >>> >>> >>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote: >>> >>>> Hi Pablo, >>>> >>>> Wow, I really love this idea. This will greatly enrich the airflow >>>> ecosystem. >>>> >>>> I agree with Ash, it is better to have it as a standalone package. And >>>> we can use this framework to write airflow core invariants tests, so that >>>> we will run them on every airflow release to guarantee no regressions. >>>> >>>> Thanks, >>>> >>>> Ping >>>> >>>> >>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada >>>> <[email protected]> wrote: >>>> >>>>> Understood! >>>>> >>>>> TL;DR: I propose a testing framework where users can check for 'DAG >>>>> execution invariants' or 'DAG execution expectations' given certain task >>>>> outcomes. >>>>> >>>>> As DAGs grow in complexity, sometimes it might become difficult to >>>>> reason about their runtime behavior in many scenarios. Users may want to >>>>> lay out rules in the form of tests that can verify DAG execution results. >>>>> For example: >>>>> >>>>> - If any of my database_backup_* tasks fails, I want to ensure that at >>>>> least one email_alert_* task will run. >>>>> - If my 'check_authentication' task fails, I want to ensure that the >>>>> whole DAG will fail. >>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a >>>>> PubsubOperator downstream will always run. >>>>> >>>>> These sorts of invariants don't need the DAG to be executed; but in >>>>> fact, they are pretty hard to test today: Staging environments can't check >>>>> every possible runtime outcome. >>>>> >>>>> In this framework, users would define unit tests like this: >>>>> >>>>> ``` >>>>> def test_my_example_dag(): >>>>> the_dag = models.DAG( >>>>> 'the_basic_dag', >>>>> schedule_interval='@daily', >>>>> start_date=DEFAULT_DATE, >>>>> ) >>>>> >>>>> with the_dag: >>>>> op1 = EmptyOperator(task_id='task_1') >>>>> op2 = EmptyOperator(task_id='task_2') >>>>> op3 = EmptyOperator(task_id='task_3') >>>>> >>>>> op1 >> op2 >> op3 >>>>> # DAG invariant: If task_1 and task_2 succeeds, then task_3 will >>>>> always run >>>>> assert_that( >>>>> given(thedag)\ >>>>> .when(task('task_1'), succeeds())\ >>>>> .and_(task('task_2'), succeeds())\ >>>>> .then(task('task_3'), runs())) >>>>> ``` >>>>> >>>>> This is a very simple example - and it's not great, because it only >>>>> duplicates the DAG logic - but you can see more examples in my draft >>>>> PR >>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1] >>>>> and in my draft AIP >>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g> >>>>> [2]. >>>>> >>>>> I started writing up an AIP in a Google doc[2] which y'all can check. >>>>> It's very close to what I have written here : ) >>>>> >>>>> LMK what y'all think. I am also happy to publish this as a separate >>>>> library if y'all wanna be cautious about adding it directly to Airflow. >>>>> -P. >>>>> >>>>> [1] >>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82 >>>>> [2] >>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit# >>>>> >>>>> >>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote: >>>>> >>>>>> Yep. Just outline your proposal on devlist, Pablo :). >>>>>> >>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> > Hi Pablo, >>>>>> > >>>>>> > Could you describe at a high level what you are thinking of? It's >>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't >>>>>> significant enough to need an AIP. >>>>>> > >>>>>> > Thanks, >>>>>> > Ash >>>>>> > >>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada >>>>>> <[email protected]> wrote: >>>>>> >> >>>>>> >> Hi there! >>>>>> >> I would like to start a discussion of an idea that I had for a >>>>>> testing framework for airflow. >>>>>> >> I believe the first step would be to write up an AIP - so could I >>>>>> have access to write a new one on the cwiki? >>>>>> >> >>>>>> >> Thanks! >>>>>> >> -P. >>>>>> >>>>>
