That was indeed a long one. I think I like where you are going with it and it feels like a nice step if the goal is to consider a full break. I also think consolidating everything related to a given provider in one place like that will make it much more intuitive for contributors.
I have two immediate thoughts on it. First, can this be used as an opportunity to reorganize the system tests tree into the provider's test tree? Instead of having `tests/system/providers/{foo}`, maybe we can move those system tests alongside the other provider tests like this: [cid:cd7da5a0-c387-47cb-bb5f-a593baa4f757] I'm not sure if that would mess with how pytest runs the system tests, but maybe it's something that can be done. My other thought is that I think I am missing something here as far as the new organization goes. I haven't spent the time that you have spent looking at how to make this work, and I also realize this is early PoC discussion so maybe this isn't the end target, but the new paths are very long and redundant. Is that for automation reasons? For example in the image you shared: `airflow/providers/airbyte/hooks/airbyte.py` becomes `airflow/providers/airbyte/src/airflow/providers/airbyte/hooks/airbyte.py` `tests/providers/airbyte/hooks/test_airbyte.py` becomes `airflow/providers/airbyte/tests/airflow/providers/airbyte/hooks/test_airbyte.py. If that redundancy can be removed, that would be best. /src/ basically maps to package root, and /tests/ is also a top-level directory/module so I'm guessing the plan is just to copy those directory trees over into the Airflow project tree using a script? - Dennis ________________________________ From: Jarek Potiuk <ja...@potiuk.com> Sent: Sunday, December 11, 2022 5:25 PM To: dev@airflow.apache.org Subject: [EXTERNAL] [PROPOSAL] (Internal) move of provider packages to isolated "providers" sub-folders CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hello everyone, TL;DR: I have a proposal to reorganise the way how our providers are kept in our sources to reflect more the standard ways of packaging of the providers. It's a REALLY long one, so If you are not ready to deep dive in discussion on the structure of the airflow project, you might want to skip that one :). NOTE. This is not (yet) a discussion about separating the providers out completely from "airflow" repo. This discussion is yet to happen in the future and it has much more "social" than "technology" aspects and this discussion is something we know will come. But before we get to "split providers in repo" discussion, I would like (and I also know few other people think in a similar direction - for sure Ash and TP that we have at least few - including some recent chats with) how our provider packages are "packaged" inside our repo. Actually once we succeed "interna" separation of providers - each provider will be much more "sandboxed" and if we decide to split them out to a separate repo (or repos) it will be much easier. Or maybe we get to the conclusion that such separation inside a mono-repo is "good enough" and we decide not to split to separate repos (which is also a possibility). CURRENT STATE: The current state is that "airflow.providers.*" source folders live under the "airflow" package in our sources. This historically comes from 1.10 and the way we back-ported old providers and has some good and bad properties - the good property is that you can do `pip install -e .` and you can develop both providers and core together so setup for anyone who contributes to providers is easy But this has also the side-effect that we cannot use super-standard approach and standards for packaging defined in modern Python world and PyPA - namely those two PEPs: https://peps.python.org/pep-0517/ and https://peps.python.org/pep-0621/ . Our build system for providers is pretty convoluted as it requires dynamic generation of setup.py and running setuptools. It works (for a few years now) - but for example if someone - not release manager - wants to build the package, has to use "breeze" to generate the sources and run the build. PEP 517 – A build-system independent format for source trees | peps.python.org<https://peps.python.org/pep-0517/> peps.python.org Python Enhancement Proposals (PEPs) PEP 621 – Storing project metadata in pyproject.toml | peps.python.org<https://peps.python.org/pep-0621/> peps.python.org Python Enhancement Proposals (PEPs) Also part of the problem is that provider - related stuff is scattered all-over the repo in a few places: * airflow/providers/ * tests/providers * tests/integration/providers * tests/system/providers/ * docs/apache-airflow-providers* Likely some other places too that we will find out :). Over the last couple of months I did small, incremental refactoring and restructuring of our build system to make it easier, but finally after separating out integration tests, we should be ready to make a "big" refactor to restructure the packages. I think we are very close to being able to make it in the way that it will be just "moving" the files around without any changes to them (maybe some in tests), so that's why I made a POC and wanted to discuss it before investing more time in making it a reality. PROPOSAL: What I would like to achieve is to: * have each provider to have a separate, complete, standard structure of its folder * use non-setuptools based builder that only uses pyproject.toml (for example flit). * I would like to keep the "easiness" of development - both with breeze and local venv for people who are developing providers (especially when they want to implement changes in several providers at the same time). * for people using IDEs such as VSCode or INtelliJ or Codespaces or whatever, it should be straightforward how to work on the common codebase of Airflow and providers. Actually this is the most difficult part of it. RESULT: I developed a fully automated POC of a script that can convert all providers in one go: * https://github.com/apache/airflow/pull/28291 [https://opengraph.githubassets.com/64f0a96a5194a8f54a6144bb49ff10901ba687cf6d86179c3b616a03cd9e1b23/apache/airflow/pull/28291]<https://github.com/apache/airflow/pull/28291> Add script to move providers to a new directory structure by potiuk · Pull Request #28291 · apache/airflow · GitHub<https://github.com/apache/airflow/pull/28291> github.com ^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments. Here is also how the repo might look like after I applied the script: * https://github.com/apache/airflow/pull/28292 [https://opengraph.githubassets.com/222611e3057422172e2fb33e3c93d2f3968610eed959a6294ffedc38d6ed490f/apache/airflow/pull/28292]<https://github.com/apache/airflow/pull/28292> Draft POC attempt to make a better structure for providers by potiuk · Pull Request #28292 · apache/airflow · GitHub<https://github.com/apache/airflow/pull/28292> github.com This is a draft attempt showing the new structure of provider package if we separte them out to separate provider directory. This is still not a complete solution - we need to work out a few more things regarding docs building and breeze integration and package importing so that everything works directly from the sources, but it should be good to start discussion about the move. ^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments. You can check it out and see the new structure in place for all providers. I also attach the image of the structure I came up with (also see the link here if you cannot see attachments: https://gcdnb.pbrd.co/images/R0RtjHbnnqNw.png?o=1 WHAT's DONE: * providers are fully isolated from each other in a separate, standalone project inside our repo. * they have a common structure: provider/<PROVIDER> src/ airflow/providers/<PROVIDER>/ operators/ hooks/ docs/ tests/ airflow/providers/<PROVIDER>/ operators/ hooks/ tests/system/ airflow/providers/<PROVIDER>/ ... tests/integration pyproject.toml INSTALL.txt README.rst provider.yaml ... * all of the files that are needed to build a package or develop are committed together with each provider - turning the "providers/<PROVIDER>" into a complete, separate, closed project. This means that we have quite some duplication in the project, but those files can be also re-generated by `pre-commits` as needed as they are automatically generated in most cases. * there is no setup.py, setup.cfg any more, there is only a pyproject.toml. Theoretically any compliant PEP 517/621 builder could be used - (almost - you have to actually choose the tool and make some requirements and not all build tools support everything). I chose flit as a tool to integrate with and it works rather nicely. But I am happy to discuss (especially with TP) other choices we might make. * we can now automatically build all the packages - I modified our release tools to use `flit` and it works nicely (and you can also do it now easily manually too if you do not want to use breeze). I also compared few generated packages and we are close to say we have 1-1 replacement (there are likely some embedded data files that need to be more verified/fixed) * Note that this is purely development-related change. If we make all the packages contain the same files or very similar ones, there should be pretty-much 0 impact on production installation of airflow and providers.. WHAT'S LEFT: * CI /tests do not work as we need a bit more work to make sure those packages from separate projects are importable in CI directly from sources without building and installing the packages and making sure that all tests work. It might require some import trickery though because airflow and airflow.providers packages overlap. * The local development "easiness" is not yet addressed. Those projects are separate from the main airflow. I think this is the most difficult part of the move actually - to make sure that any new or even existing contributor can start developing any change to any provider very easily without heavy IDE configuration. Maybe even it will be difficult to achieve it in some cases (for example in standard community PyCharm or VSCode or even using Codespaces (that is available now for free in a capacity that is good for casual users). But I will invest quite some time to make it as straightforward as possible and "natural" for sure. For me this is an absolute prerequisite - to make sure contributions to Airflow and Providers remain easy for first-time users. * Documentation building does not yet work. restructuring the docs to separate folders in providers might require some Sphinx trickery - I want to simply make it possible that you can build the doc of each provider separately, making them fully standalone packages. * None of the choices I made are set in stone. I proposed the package structure as in the attached picture, but I am happy to discuss pros and cons of different approaches. This refactor is fully automated with the Python script I created, so we can modify it and update until we decide it's ready. There are few choices, which I am not sure about. * I think - if we decide this is a good move - we should make it a "clean-cut" and migrate all providers at once. The change is very invasive in our tooling - they depend on the structure to be in place, so it would be terribly complicated to keep our CI and dev-env to support both approaches in parallel. But once we solve the "development easiness", this will also be a very "easy" move. Most of the files will be simply moved and not modified, which will mean that Git will keep track of them, it will break a number of open PRs, but it has no impact on "airflow" cherry-picking because all the changes are in "providers" not in core airflow. MY ASKS: If you got that far - congratulations :D. It was really a long one. I wanted us to discuss some questions: * Do you like that idea? * Do you have any concerns? * Any comments or proposals w/regards to structure/tools? J.