I had some thoughts about it - this also connected with recent discussions
about mixed governance for providers, and I think it's worth using this
discussion to set some rules and "boundaries" on when and how and
especially why we want to accept some contributions, while for some other
contributions it's better to be outside.

We are about to start more seriously thinking (and discussing) on how to
split Airflow providers off airflow. And I think we can split off more than
providers - this might be a good candidate to be a standalone, but still
community maintained package. If we are going to solve the problem of
splitting airflow to N packages, one more package does not matter.
And it would nicely solve "version independence". We could even make it
airflow 2.0+ compliant if we want.

So I think while the question of "is it tied with a specific airflow
version or not" does not really prevent us from making it part of community
- those two are not related (if we are going to have more repositories
anyway)

The important part is really how "self-servicing" we can make it and how we
make sure it stays relevant with future versions of Airflow and who does it
I think - namely who has the incentive and "responsibility" to maintain it.
I am sure we will add more features to Airflow DAGs and simplify the way
DAGs are written over time, and the test harness will have to adapt to it.

There are pros and cons of having such a standalone package "in the
community/ASF project" and "out of it". We have a good example (from
similar kinds of tools/utils) in the past that we can learn from(and maybe
Bas can share more insights).

https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs

Initially that was "sponsored" by GoDataDriven where Bas worked and I think
this is where it was born. And that made sense as it was likely also useful
for the customers of GoDataDriven (here I am guessing). But apparently both
GoDataDriven's incentives winded down and it turned out that usefulness of
it was not as big (also I think we all in Python community learned that
Pylint is more of a distraction than real help - we dumped Pylint
eventually and the plugin was not maintained beyond some versions of 1.10.
And the tool is all but defunct now. Which is perfectly understandable.

In this case there is (I think) no risk of a "pylint" like problem, but the
question of maintenance and adaptation to future versions of Airflow
remains.

I think there is one big differences of something that is "in ASF repos"
and "out":

* if we make it a standalone package in "asf airflow community" - we will
have some obligation and expectations from our users to maintain it. We can
add some test harness (regardless if it will be in airflow repository or in
a separate one) to make sure that new airflow "core" changes will not break
it (and we can fail our PRs if they do - basically making "core"
maintainers take care about this problem rather than delegate it to someone
else to react on core changes (this is what has to  happen with providers I
believe even if we split them to separate repo).  I think anything that we
as the ASF community release should have such harnesses - making sure that
whatever we release and make available to our users work together.

* if it is outside of the "ASF community", someone will have to react to
"core airflow" changes. We will not do it in the community, we will not pay
attention, such an "external tool" might break at any time because we
introduced a change in part of a core that the external tool implicitly
relied on.

For me the question is whether something should be in/out should be based
on :

* is it really useful for the community as a whole? -> if yes we should
consider it
* is it strongly tied with the core of airflow in the sense of relying on
some internals that might change easily? -> if not, there is no need to
bring it in, it can be easily maintained outside by anyone
* if it is strongly tied with the core - > is there someone (person,
organisation) who wants to take the burden of maintaining it and has
incentive of doing it for quite some time -> if yes, great, let them do
that!
* if it is strongly tied, do we want to take a burden as "core airflow
maintainers" to keep it updated together with the core if it is? -> if yes,
we should bring it in

If we have a strongly tied tool that we do not want to maintain in the core
and there is no entity who would like to do it, then I think this idea
should be dropped :).

J.


On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote:

> Hi Pablo,
>
> Wow, I really love this idea. This will greatly enrich the airflow
> ecosystem.
>
> I agree with Ash, it is better to have it as a standalone package. And we
> can use this framework to write airflow core invariants tests, so that we
> will run them on every airflow release to guarantee no regressions.
>
> Thanks,
>
> Ping
>
>
> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <[email protected]>
> wrote:
>
>> Understood!
>>
>> TL;DR: I propose a testing framework where users can check for 'DAG
>> execution invariants' or 'DAG execution expectations' given certain task
>> outcomes.
>>
>> As DAGs grow in complexity, sometimes it might become difficult to reason
>> about their runtime behavior in many scenarios. Users may want to lay out
>> rules in the form of tests that can verify  DAG execution results. For
>> example:
>>
>> - If any of my database_backup_* tasks fails, I want to ensure that at
>> least one email_alert_* task will run.
>> - If my 'check_authentication' task fails, I want to ensure that the
>> whole DAG will fail.
>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>> PubsubOperator downstream will always run.
>>
>> These sorts of invariants don't need the DAG to be executed; but in fact,
>> they are pretty hard to test today: Staging environments can't check every
>> possible runtime outcome.
>>
>> In this framework, users would define unit tests like this:
>>
>> ```
>> def test_my_example_dag():
>>   the_dag = models.DAG(
>>         'the_basic_dag',
>>         schedule_interval='@daily',
>>         start_date=DEFAULT_DATE,
>>     )
>>
>>     with the_dag:
>>         op1 = EmptyOperator(task_id='task_1')
>>         op2 = EmptyOperator(task_id='task_2')
>>         op3 = EmptyOperator(task_id='task_3')
>>
>>         op1 >> op2 >> op3
>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>> always run
>>     assert_that(
>>             given(thedag)\
>>                 .when(task('task_1'), succeeds())\
>>                 .and_(task('task_2'), succeeds())\
>>                 .then(task('task_3'), runs()))
>> ```
>>
>> This is a very simple example - and it's not great, because it only
>> duplicates the DAG logic - but you can see more examples in my draft PR
>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>> and in my draft AIP
>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>> [2].
>>
>> I started writing up an AIP in a Google doc[2] which y'all can check.
>> It's very close to what I have written here : )
>>
>> LMK what y'all think. I am also happy to publish this as a separate
>> library if y'all wanna be cautious about adding it directly to Airflow.
>> -P.
>>
>> [1]
>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>> [2]
>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>
>>
>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote:
>>
>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>
>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]>
>>> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Could you describe at a high level what you are thinking of? It's
>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>> significant enough to need an AIP.
>>> >
>>> > Thanks,
>>> > Ash
>>> >
>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi there!
>>> >> I would like to start a discussion of an idea that I had for a
>>> testing framework for airflow.
>>> >> I believe the first step would be to write up an AIP - so could I
>>> have access to write a new one on the cwiki?
>>> >>
>>> >> Thanks!
>>> >> -P.
>>>
>>

Reply via email to