Hello everyone, TL;DR: I have a proposal to reorganise the way how our providers are kept in our sources to reflect more the standard ways of packaging of the providers. It's a REALLY long one, so If you are not ready to deep dive in discussion on the structure of the airflow project, you might want to skip that one :).
NOTE. This is not (yet) a discussion about separating the providers out completely from "airflow" repo. This discussion is yet to happen in the future and it has much more "social" than "technology" aspects and this discussion is something we know will come. But before we get to "split providers in repo" discussion, I would like (and I also know few other people think in a similar direction - for sure Ash and TP that we have at least few - including some recent chats with) how our provider packages are "packaged" inside our repo. Actually once we succeed "interna" separation of providers - each provider will be much more "sandboxed" and if we decide to split them out to a separate repo (or repos) it will be much easier. Or maybe we get to the conclusion that such separation inside a mono-repo is "good enough" and we decide not to split to separate repos (which is also a possibility). CURRENT STATE: The current state is that "airflow.providers.*" source folders live under the "airflow" package in our sources. This historically comes from 1.10 and the way we back-ported old providers and has some good and bad properties - the good property is that you can do `pip install -e .` and you can develop both providers and core together so setup for anyone who contributes to providers is easy But this has also the side-effect that we cannot use super-standard approach and standards for packaging defined in modern Python world and PyPA - namely those two PEPs: https://peps.python.org/pep-0517/ and https://peps.python.org/pep-0621/ . Our build system for providers is pretty convoluted as it requires dynamic generation of setup.py and running setuptools. It works (for a few years now) - but for example if someone - not release manager - wants to build the package, has to use "breeze" to generate the sources and run the build. Also part of the problem is that provider - related stuff is scattered all-over the repo in a few places: * airflow/providers/ * tests/providers * tests/integration/providers * tests/system/providers/ * docs/apache-airflow-providers* Likely some other places too that we will find out :). Over the last couple of months I did small, incremental refactoring and restructuring of our build system to make it easier, but finally after separating out integration tests, we should be ready to make a "big" refactor to restructure the packages. I think we are very close to being able to make it in the way that it will be just "moving" the files around without any changes to them (maybe some in tests), so that's why I made a POC and wanted to discuss it before investing more time in making it a reality. PROPOSAL: What I would like to achieve is to: * have each provider to have a separate, complete, standard structure of its folder * use non-setuptools based builder that only uses pyproject.toml (for example flit). * I would like to keep the "easiness" of development - both with breeze and local venv for people who are developing providers (especially when they want to implement changes in several providers at the same time). * for people using IDEs such as VSCode or INtelliJ or Codespaces or whatever, it should be straightforward how to work on the common codebase of Airflow and providers. Actually this is the most difficult part of it. RESULT: I developed a fully automated POC of a script that can convert all providers in one go: * https://github.com/apache/airflow/pull/28291 Here is also how the repo might look like after I applied the script: * https://github.com/apache/airflow/pull/28292 You can check it out and see the new structure in place for all providers. I also attach the image of the structure I came up with (also see the link here if you cannot see attachments: https://gcdnb.pbrd.co/images/R0RtjHbnnqNw.png?o=1 WHAT's DONE: * providers are fully isolated from each other in a separate, standalone project inside our repo. * they have a common structure: provider/<PROVIDER> src/ airflow/providers/<PROVIDER>/ operators/ hooks/ docs/ tests/ airflow/providers/<PROVIDER>/ operators/ hooks/ tests/system/ airflow/providers/<PROVIDER>/ ... tests/integration pyproject.toml INSTALL.txt README.rst provider.yaml ... * all of the files that are needed to build a package or develop are committed together with each provider - turning the "providers/<PROVIDER>" into a complete, separate, closed project. This means that we have quite some duplication in the project, but those files can be also re-generated by `pre-commits` as needed as they are automatically generated in most cases. * there is no setup.py, setup.cfg any more, there is only a pyproject.toml. Theoretically any compliant PEP 517/621 builder could be used - (almost - you have to actually choose the tool and make some requirements and not all build tools support everything). I chose flit as a tool to integrate with and it works rather nicely. But I am happy to discuss (especially with TP) other choices we might make. * we can now automatically build all the packages - I modified our release tools to use `flit` and it works nicely (and you can also do it now easily manually too if you do not want to use breeze). I also compared few generated packages and we are close to say we have 1-1 replacement (there are likely some embedded data files that need to be more verified/fixed) * Note that this is purely development-related change. If we make all the packages contain the same files or very similar ones, there should be pretty-much 0 impact on production installation of airflow and providers.. WHAT'S LEFT: * CI /tests do not work as we need a bit more work to make sure those packages from separate projects are importable in CI directly from sources without building and installing the packages and making sure that all tests work. It might require some import trickery though because airflow and airflow.providers packages overlap. * The local development "easiness" is not yet addressed. Those projects are separate from the main airflow. I think this is the most difficult part of the move actually - to make sure that any new or even existing contributor can start developing any change to any provider very easily without heavy IDE configuration. Maybe even it will be difficult to achieve it in some cases (for example in standard community PyCharm or VSCode or even using Codespaces (that is available now for free in a capacity that is good for casual users). But I will invest quite some time to make it as straightforward as possible and "natural" for sure. For me this is an absolute prerequisite - to make sure contributions to Airflow and Providers remain easy for first-time users. * Documentation building does not yet work. restructuring the docs to separate folders in providers might require some Sphinx trickery - I want to simply make it possible that you can build the doc of each provider separately, making them fully standalone packages. * None of the choices I made are set in stone. I proposed the package structure as in the attached picture, but I am happy to discuss pros and cons of different approaches. This refactor is fully automated with the Python script I created, so we can modify it and update until we decide it's ready. There are few choices, which I am not sure about. * I think - if we decide this is a good move - we should make it a "clean-cut" and migrate all providers at once. The change is very invasive in our tooling - they depend on the structure to be in place, so it would be terribly complicated to keep our CI and dev-env to support both approaches in parallel. But once we solve the "development easiness", this will also be a very "easy" move. Most of the files will be simply moved and not modified, which will mean that Git will keep track of them, it will break a number of open PRs, but it has no impact on "airflow" cherry-picking because all the changes are in "providers" not in core airflow. MY ASKS: If you got that far - congratulations :D. It was really a long one. I wanted us to discuss some questions: * Do you like that idea? * Do you have any concerns? * Any comments or proposals w/regards to structure/tools? J.