That was indeed a long one.  I think I like where you are going with it and it 
feels like a nice step if the goal is to consider a full break.  I also think 
consolidating everything related to a given provider in one place like that 
will make it much more intuitive for contributors.

I have two immediate thoughts on it.   First, can this be used as an 
opportunity to reorganize the system tests tree into the provider's test tree?  
Instead of having `tests/system/providers/{foo}`, maybe we can move those 
system tests alongside the other provider tests like this:


[cid:cd7da5a0-c387-47cb-bb5f-a593baa4f757]


I'm not sure if that would mess with how pytest runs the system tests, but 
maybe it's something that can be done.


My other thought is that I think I am missing something here as far as the new 
organization goes.  I haven't spent the time that you have spent looking at how 
to make this work, and I also realize this is early PoC discussion so maybe 
this isn't the end target, but the new paths are very long and redundant.  Is 
that for automation reasons?  For example in the image you shared:

`airflow/providers/airbyte/hooks/airbyte.py` becomes 
`airflow/providers/airbyte/src/airflow/providers/airbyte/hooks/airbyte.py`
`tests/providers/airbyte/hooks/test_airbyte.py` becomes 
`airflow/providers/airbyte/tests/airflow/providers/airbyte/hooks/test_airbyte.py.

If that redundancy can be removed, that would be best.  /src/ basically maps to 
package root, and /tests/ is also a top-level directory/module so I'm guessing 
the plan is just to copy those directory trees over into the Airflow project 
tree using a script?


- Dennis


________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Sunday, December 11, 2022 5:25 PM
To: dev@airflow.apache.org
Subject: [EXTERNAL] [PROPOSAL] (Internal) move of provider packages to isolated 
"providers" sub-folders


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hello everyone,

TL;DR: I have a proposal to reorganise the way how our providers are kept in 
our sources to reflect more the standard ways of packaging of the providers. 
It's a REALLY long one, so If you are not ready to deep dive in discussion on 
the structure of the airflow project, you might want to skip that one :).


NOTE. This is not (yet) a discussion about separating the providers out 
completely from "airflow" repo. This discussion is yet to happen in the future 
and it has much more "social" than "technology" aspects and this discussion is 
something we know will  come.

But before we get to "split providers in repo" discussion, I would like (and I 
also know few other people think in a similar direction - for sure Ash and TP 
that we have at least few - including some recent chats with) how our provider 
packages are "packaged" inside our repo.

Actually once we succeed "interna" separation of providers  - each provider 
will be much more "sandboxed" and if we decide to split them out to a separate 
repo (or repos) it will be much easier. Or maybe we get to the conclusion that 
such separation inside a mono-repo is "good enough" and we decide not to split 
to separate repos (which is also a possibility).

CURRENT STATE:

The current state is that "airflow.providers.*" source folders live under the 
"airflow" package in our sources. This historically comes from 1.10 and the way 
we back-ported old providers and has some good and bad properties - the good 
property is that you can do `pip install -e .` and you can develop both 
providers and core together so setup for anyone who contributes to providers is 
easy

But this has also the side-effect that we cannot use super-standard approach 
and standards for packaging defined in modern Python world and PyPA  - namely 
those two PEPs: https://peps.python.org/pep-0517/ and 
https://peps.python.org/pep-0621/ . Our build system for providers is pretty 
convoluted as it requires dynamic generation of setup.py and running 
setuptools. It works (for a few years now) - but for example if someone - not 
release manager - wants to build the package, has to use "breeze" to generate 
the sources and run the build.
PEP 517 – A build-system independent format for source trees | 
peps.python.org<https://peps.python.org/pep-0517/>
peps.python.org
Python Enhancement Proposals (PEPs)

PEP 621 – Storing project metadata in pyproject.toml | 
peps.python.org<https://peps.python.org/pep-0621/>
peps.python.org
Python Enhancement Proposals (PEPs)


Also part of the problem is that provider - related stuff is scattered all-over 
the repo in a few places:

* airflow/providers/
* tests/providers
* tests/integration/providers
* tests/system/providers/
* docs/apache-airflow-providers*

Likely some other places too that we will find out :).

Over the last couple of months I did small, incremental refactoring and 
restructuring of our build system to make it easier, but finally after 
separating out integration tests, we should be ready to make a "big" refactor 
to restructure the packages.

I think we are very close to being able to make it in the way that it will be 
just "moving" the files around without any changes to them (maybe some in 
tests), so that's why I made a POC and wanted to discuss it before investing 
more time in making it a reality.

PROPOSAL:

What I would like to achieve is to:

* have each provider to have a separate, complete, standard structure of its 
folder
* use non-setuptools based builder that only uses pyproject.toml (for example 
flit).
* I would like to keep the "easiness" of development - both with breeze and 
local venv for people who are developing providers (especially when they want 
to implement changes in several providers at the same time).
* for people using IDEs such as VSCode or INtelliJ or Codespaces or whatever, 
it should be straightforward how to work on the common codebase of Airflow and 
providers. Actually this is the most difficult part of it.

RESULT:

I developed a fully automated POC of a script that can convert all providers in 
one go:

* https://github.com/apache/airflow/pull/28291
[https://opengraph.githubassets.com/64f0a96a5194a8f54a6144bb49ff10901ba687cf6d86179c3b616a03cd9e1b23/apache/airflow/pull/28291]<https://github.com/apache/airflow/pull/28291>

Add script to move providers to a new directory structure by potiuk · Pull 
Request #28291 · apache/airflow · 
GitHub<https://github.com/apache/airflow/pull/28291>
github.com
^ Add meaningful description above Read the Pull Request Guidelines for more 
information. In case of fundamental code changes, an Airflow Improvement 
Proposal (AIP) is needed. In case of a new dependency, check compliance with 
the ASF 3rd Party License Policy. In case of backwards incompatible changes 
please leave a note in a newsfragment file, named {pr_number}.significant.rst 
or {issue_number}.significant.rst, in newsfragments.


Here is also how the repo might look like after I applied the script:

* https://github.com/apache/airflow/pull/28292
[https://opengraph.githubassets.com/222611e3057422172e2fb33e3c93d2f3968610eed959a6294ffedc38d6ed490f/apache/airflow/pull/28292]<https://github.com/apache/airflow/pull/28292>

Draft POC attempt to make a better structure for providers by potiuk · Pull 
Request #28292 · apache/airflow · 
GitHub<https://github.com/apache/airflow/pull/28292>
github.com
This is a draft attempt showing the new structure of provider package if we 
separte them out to separate provider directory. This is still not a complete 
solution - we need to work out a few more things regarding docs building and 
breeze integration and package importing so that everything works directly from 
the sources, but it should be good to start discussion about the move. ^ Add 
meaningful description above Read the Pull Request Guidelines for more 
information. In case of fundamental code changes, an Airflow Improvement 
Proposal (AIP) is needed. In case of a new dependency, check compliance with 
the ASF 3rd Party License Policy. In case of backwards incompatible changes 
please leave a note in a newsfragment file, named {pr_number}.significant.rst 
or {issue_number}.significant.rst, in newsfragments.


You can check it out and see the new structure in place for all providers.

I also attach the image of the structure I came up with (also see the link here 
if you cannot see attachments:  
https://gcdnb.pbrd.co/images/R0RtjHbnnqNw.png?o=1

WHAT's DONE:

* providers are fully isolated from each other in a separate, standalone 
project inside our repo.
* they have a common structure:

provider/<PROVIDER>
   src/
        airflow/providers/<PROVIDER>/
                                    operators/
                                    hooks/
   docs/
   tests/
        airflow/providers/<PROVIDER>/
                                    operators/
                                    hooks/
   tests/system/
               airflow/providers/<PROVIDER>/
               ...
   tests/integration
   pyproject.toml
   INSTALL.txt
   README.rst
   provider.yaml
   ...


* all of the files that are needed to build a package or develop are committed 
together with each provider - turning the "providers/<PROVIDER>" into a 
complete, separate, closed project. This means that we have quite some 
duplication in the project, but those files can be also re-generated by 
`pre-commits` as needed as they are automatically generated in most cases.

* there is no setup.py, setup.cfg any more, there is only a pyproject.toml. 
Theoretically any compliant PEP 517/621 builder could be used - (almost - you 
have to actually choose the tool and make some requirements and not all build 
tools support everything). I chose flit as a tool to integrate with and it 
works rather nicely. But I am happy to discuss (especially with TP) other 
choices we might make.

* we can now automatically build all the packages - I modified our release 
tools to use `flit` and it works nicely (and you can also do it now easily 
manually too if you do not want to use breeze). I also compared few generated 
packages and we are close to say we have 1-1 replacement (there are likely some 
embedded data files that need to be more verified/fixed)

* Note that this is purely development-related change. If we make all the 
packages contain the same files or very similar ones, there should be 
pretty-much 0 impact on production installation of airflow and providers..

WHAT'S LEFT:

* CI /tests do not work as we need a bit more work to make sure those packages 
from separate projects are importable in CI directly from sources without 
building and installing the packages and making sure that all tests work. It 
might require some import trickery though because airflow and airflow.providers 
packages overlap.

*  The local development "easiness" is not yet addressed. Those projects are 
separate from the main airflow. I think this is the most difficult part of the 
move actually - to make sure that any new or even existing contributor can 
start developing any change to any provider very easily without heavy IDE 
configuration. Maybe even it will be difficult to achieve it in some cases (for 
example in standard community PyCharm or VSCode or even using Codespaces (that 
is available now for free in a capacity that is good for casual users). But I 
will invest quite some time to make it as straightforward as possible and 
"natural" for sure. For me this is an absolute prerequisite - to make sure 
contributions to Airflow and Providers remain easy for first-time users.

* Documentation building does not yet work. restructuring the docs to separate 
folders in providers might require some Sphinx trickery - I want to simply make 
it possible that you can build the doc of each provider separately, making them 
fully standalone packages.

* None of the choices I made are set in stone. I proposed the package structure 
as in the attached picture, but I am happy to discuss pros and cons of 
different approaches. This refactor is fully automated with the Python script I 
created, so we can modify it and update until we decide it's ready. There are 
few choices, which I am not sure about.

* I think - if we decide this is a good move - we should make it a "clean-cut" 
and migrate all providers at once. The change is very invasive in our tooling - 
they depend on the structure to be in place, so it would be terribly 
complicated to keep our CI and dev-env to support both approaches in parallel. 
But once we solve the "development easiness", this will also be a very "easy" 
move. Most of the files will be simply moved and not modified, which will mean 
that Git will keep track of them, it will break a number of open PRs, but it 
has no impact on "airflow" cherry-picking because all the changes are in 
"providers" not in core airflow.

MY ASKS:

If you got that far - congratulations :D. It was really a long one. I wanted us 
to discuss some questions:

* Do you like that idea?
* Do you have any concerns?
* Any comments or proposals w/regards to structure/tools?

J.

Reply via email to