Hello everyone,

TL;DR: I have a proposal to reorganise the way how our providers are kept
in our sources to reflect more the standard ways of packaging of the
providers. It's a REALLY long one, so If you are not ready to deep dive in
discussion on the structure of the airflow project, you might want to skip
that one :).


NOTE. This is not (yet) a discussion about separating the providers out
completely from "airflow" repo. This discussion is yet to happen in the
future and it has much more "social" than "technology" aspects and this
discussion is something we know will  come.

But before we get to "split providers in repo" discussion, I would like
(and I also know few other people think in a similar direction - for sure
Ash and TP that we have at least few - including some recent chats with)
how our provider packages are "packaged" inside our repo.

Actually once we succeed "interna" separation of providers  - each provider
will be much more "sandboxed" and if we decide to split them out to a
separate repo (or repos) it will be much easier. Or maybe we get to the
conclusion that such separation inside a mono-repo is "good enough" and we
decide not to split to separate repos (which is also a possibility).

CURRENT STATE:

The current state is that "airflow.providers.*" source folders live under
the "airflow" package in our sources. This historically comes from 1.10 and
the way we back-ported old providers and has some good and bad properties -
the good property is that you can do `pip install -e .` and you can develop
both providers and core together so setup for anyone who contributes to
providers is easy

But this has also the side-effect that we cannot use super-standard
approach and standards for packaging defined in modern Python world and
PyPA  - namely those two PEPs: https://peps.python.org/pep-0517/ and
https://peps.python.org/pep-0621/ . Our build system for providers is
pretty convoluted as it requires dynamic generation of setup.py and running
setuptools. It works (for a few years now) - but for example if someone -
not release manager - wants to build the package, has to use "breeze" to
generate the sources and run the build.

Also part of the problem is that provider - related stuff is scattered
all-over the repo in a few places:

* airflow/providers/
* tests/providers
* tests/integration/providers
* tests/system/providers/
* docs/apache-airflow-providers*

Likely some other places too that we will find out :).

Over the last couple of months I did small, incremental refactoring and
restructuring of our build system to make it easier, but finally after
separating out integration tests, we should be ready to make a "big"
refactor to restructure the packages.

I think we are very close to being able to make it in the way that it will
be just "moving" the files around without any changes to them (maybe some
in tests), so that's why I made a POC and wanted to discuss it before
investing more time in making it a reality.

PROPOSAL:

What I would like to achieve is to:

* have each provider to have a separate, complete, standard structure of
its folder
* use non-setuptools based builder that only uses pyproject.toml (for
example flit).
* I would like to keep the "easiness" of development - both with breeze and
local venv for people who are developing providers (especially when they
want to implement changes in several providers at the same time).
* for people using IDEs such as VSCode or INtelliJ or Codespaces or
whatever, it should be straightforward how to work on the common codebase
of Airflow and providers. Actually this is the most difficult part of it.

RESULT:

I developed a fully automated POC of a script that can convert all
providers in one go:

* https://github.com/apache/airflow/pull/28291

Here is also how the repo might look like after I applied the script:

* https://github.com/apache/airflow/pull/28292

You can check it out and see the new structure in place for all providers.

I also attach the image of the structure I came up with (also see the link
here if you cannot see attachments:
https://gcdnb.pbrd.co/images/R0RtjHbnnqNw.png?o=1

WHAT's DONE:

* providers are fully isolated from each other in a separate, standalone
project inside our repo.
* they have a common structure:

provider/<PROVIDER>
   src/
        airflow/providers/<PROVIDER>/
                                    operators/
                                    hooks/
   docs/
   tests/
        airflow/providers/<PROVIDER>/
                                    operators/
                                    hooks/
   tests/system/
               airflow/providers/<PROVIDER>/
               ...
   tests/integration
   pyproject.toml
   INSTALL.txt
   README.rst
   provider.yaml
   ...


* all of the files that are needed to build a package or develop are
committed together with each provider - turning the "providers/<PROVIDER>"
into a complete, separate, closed project. This means that we have quite
some duplication in the project, but those files can be also re-generated
by `pre-commits` as needed as they are automatically generated in most
cases.

* there is no setup.py, setup.cfg any more, there is only a pyproject.toml.
Theoretically any compliant PEP 517/621 builder could be used - (almost -
you have to actually choose the tool and make some requirements and not all
build tools support everything). I chose flit as a tool to integrate with
and it works rather nicely. But I am happy to discuss (especially with TP)
other choices we might make.

* we can now automatically build all the packages - I modified our release
tools to use `flit` and it works nicely (and you can also do it now easily
manually too if you do not want to use breeze). I also compared few
generated packages and we are close to say we have 1-1 replacement (there
are likely some embedded data files that need to be more verified/fixed)

* Note that this is purely development-related change. If we make all the
packages contain the same files or very similar ones, there should be
pretty-much 0 impact on production installation of airflow and providers..

WHAT'S LEFT:

* CI /tests do not work as we need a bit more work to make sure those
packages from separate projects are importable in CI directly from sources
without building and installing the packages and making sure that all tests
work. It might require some import trickery though because airflow and
airflow.providers packages overlap.

*  The local development "easiness" is not yet addressed. Those projects
are separate from the main airflow. I think this is the most difficult part
of the move actually - to make sure that any new or even existing
contributor can start developing any change to any provider very easily
without heavy IDE configuration. Maybe even it will be difficult to achieve
it in some cases (for example in standard community PyCharm or VSCode or
even using Codespaces (that is available now for free in a capacity that is
good for casual users). But I will invest quite some time to make it as
straightforward as possible and "natural" for sure. For me this is an
absolute prerequisite - to make sure contributions to Airflow and Providers
remain easy for first-time users.

* Documentation building does not yet work. restructuring the docs to
separate folders in providers might require some Sphinx trickery - I want
to simply make it possible that you can build the doc of each provider
separately, making them fully standalone packages.

* None of the choices I made are set in stone. I proposed the package
structure as in the attached picture, but I am happy to discuss pros and
cons of different approaches. This refactor is fully automated with the
Python script I created, so we can modify it and update until we decide
it's ready. There are few choices, which I am not sure about.

* I think - if we decide this is a good move - we should make it a
"clean-cut" and migrate all providers at once. The change is very invasive
in our tooling - they depend on the structure to be in place, so it would
be terribly complicated to keep our CI and dev-env to support both
approaches in parallel. But once we solve the "development easiness", this
will also be a very "easy" move. Most of the files will be simply moved and
not modified, which will mean that Git will keep track of them, it will
break a number of open PRs, but it has no impact on "airflow"
cherry-picking because all the changes are in "providers" not in core
airflow.

MY ASKS:

If you got that far - congratulations :D. It was really a long one. I
wanted us to discuss some questions:

* Do you like that idea?
* Do you have any concerns?
* Any comments or proposals w/regards to structure/tools?

J.

Reply via email to