I would love to hear what others think about the "in/out" approach - mine
is just the line of thoughts I've been exploring during the last few months
where I prepared my own line of thought about providers, maintenance,
incentive of entities maintaining open-source projects, and especially -
expectations of the users that it creates. But those are just my thoughts
and I'd love to hear what others think about it.

On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <[email protected]> wrote:

> I had some thoughts about it - this also connected with recent discussions
> about mixed governance for providers, and I think it's worth using this
> discussion to set some rules and "boundaries" on when and how and
> especially why we want to accept some contributions, while for some other
> contributions it's better to be outside.
>
> We are about to start more seriously thinking (and discussing) on how to
> split Airflow providers off airflow. And I think we can split off more than
> providers - this might be a good candidate to be a standalone, but still
> community maintained package. If we are going to solve the problem of
> splitting airflow to N packages, one more package does not matter.
> And it would nicely solve "version independence". We could even make it
> airflow 2.0+ compliant if we want.
>
> So I think while the question of "is it tied with a specific airflow
> version or not" does not really prevent us from making it part of community
> - those two are not related (if we are going to have more repositories
> anyway)
>
> The important part is really how "self-servicing" we can make it and how
> we make sure it stays relevant with future versions of Airflow and who does
> it I think - namely who has the incentive and "responsibility" to maintain
> it. I am sure we will add more features to Airflow DAGs and simplify the
> way DAGs are written over time, and the test harness will have to adapt to
> it.
>
> There are pros and cons of having such a standalone package "in the
> community/ASF project" and "out of it". We have a good example (from
> similar kinds of tools/utils) in the past that we can learn from(and maybe
> Bas can share more insights).
>
> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs
>
> Initially that was "sponsored" by GoDataDriven where Bas worked and I
> think this is where it was born. And that made sense as it was likely also
> useful for the customers of GoDataDriven (here I am guessing). But
> apparently both GoDataDriven's incentives winded down and it turned out
> that usefulness of it was not as big (also I think we all in Python
> community learned that Pylint is more of a distraction than real help - we
> dumped Pylint eventually and the plugin was not maintained beyond some
> versions of 1.10. And the tool is all but defunct now. Which is perfectly
> understandable.
>
> In this case there is (I think) no risk of a "pylint" like problem, but
> the question of maintenance and adaptation to future versions of Airflow
> remains.
>
> I think there is one big differences of something that is "in ASF repos"
> and "out":
>
> * if we make it a standalone package in "asf airflow community" - we will
> have some obligation and expectations from our users to maintain it. We can
> add some test harness (regardless if it will be in airflow repository or in
> a separate one) to make sure that new airflow "core" changes will not break
> it (and we can fail our PRs if they do - basically making "core"
> maintainers take care about this problem rather than delegate it to someone
> else to react on core changes (this is what has to  happen with providers I
> believe even if we split them to separate repo).  I think anything that we
> as the ASF community release should have such harnesses - making sure that
> whatever we release and make available to our users work together.
>
> * if it is outside of the "ASF community", someone will have to react to
> "core airflow" changes. We will not do it in the community, we will not pay
> attention, such an "external tool" might break at any time because we
> introduced a change in part of a core that the external tool implicitly
> relied on.
>
> For me the question is whether something should be in/out should be based
> on :
>
> * is it really useful for the community as a whole? -> if yes we should
> consider it
> * is it strongly tied with the core of airflow in the sense of relying on
> some internals that might change easily? -> if not, there is no need to
> bring it in, it can be easily maintained outside by anyone
> * if it is strongly tied with the core - > is there someone (person,
> organisation) who wants to take the burden of maintaining it and has
> incentive of doing it for quite some time -> if yes, great, let them do
> that!
> * if it is strongly tied, do we want to take a burden as "core airflow
> maintainers" to keep it updated together with the core if it is? -> if yes,
> we should bring it in
>
> If we have a strongly tied tool that we do not want to maintain in the
> core and there is no entity who would like to do it, then I think this idea
> should be dropped :).
>
> J.
>
>
> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote:
>
>> Hi Pablo,
>>
>> Wow, I really love this idea. This will greatly enrich the airflow
>> ecosystem.
>>
>> I agree with Ash, it is better to have it as a standalone package. And we
>> can use this framework to write airflow core invariants tests, so that we
>> will run them on every airflow release to guarantee no regressions.
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <[email protected]>
>> wrote:
>>
>>> Understood!
>>>
>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>> execution invariants' or 'DAG execution expectations' given certain task
>>> outcomes.
>>>
>>> As DAGs grow in complexity, sometimes it might become difficult to
>>> reason about their runtime behavior in many scenarios. Users may want to
>>> lay out rules in the form of tests that can verify  DAG execution results.
>>> For example:
>>>
>>> - If any of my database_backup_* tasks fails, I want to ensure that at
>>> least one email_alert_* task will run.
>>> - If my 'check_authentication' task fails, I want to ensure that the
>>> whole DAG will fail.
>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>> PubsubOperator downstream will always run.
>>>
>>> These sorts of invariants don't need the DAG to be executed; but in
>>> fact, they are pretty hard to test today: Staging environments can't check
>>> every possible runtime outcome.
>>>
>>> In this framework, users would define unit tests like this:
>>>
>>> ```
>>> def test_my_example_dag():
>>>   the_dag = models.DAG(
>>>         'the_basic_dag',
>>>         schedule_interval='@daily',
>>>         start_date=DEFAULT_DATE,
>>>     )
>>>
>>>     with the_dag:
>>>         op1 = EmptyOperator(task_id='task_1')
>>>         op2 = EmptyOperator(task_id='task_2')
>>>         op3 = EmptyOperator(task_id='task_3')
>>>
>>>         op1 >> op2 >> op3
>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>> always run
>>>     assert_that(
>>>             given(thedag)\
>>>                 .when(task('task_1'), succeeds())\
>>>                 .and_(task('task_2'), succeeds())\
>>>                 .then(task('task_3'), runs()))
>>> ```
>>>
>>> This is a very simple example - and it's not great, because it only
>>> duplicates the DAG logic - but you can see more examples in my draft PR
>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>> and in my draft AIP
>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>> [2].
>>>
>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>> It's very close to what I have written here : )
>>>
>>> LMK what y'all think. I am also happy to publish this as a separate
>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>> -P.
>>>
>>> [1]
>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>> [2]
>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>
>>>
>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote:
>>>
>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>
>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]>
>>>> wrote:
>>>> >
>>>> > Hi Pablo,
>>>> >
>>>> > Could you describe at a high level what you are thinking of? It's
>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>> significant enough to need an AIP.
>>>> >
>>>> > Thanks,
>>>> > Ash
>>>> >
>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>> <[email protected]> wrote:
>>>> >>
>>>> >> Hi there!
>>>> >> I would like to start a discussion of an idea that I had for a
>>>> testing framework for airflow.
>>>> >> I believe the first step would be to write up an AIP - so could I
>>>> have access to write a new one on the cwiki?
>>>> >>
>>>> >> Thanks!
>>>> >> -P.
>>>>
>>>

Reply via email to