>> 1) I love the separation and the subproject structure overall. Especially with docs in there it should help speed up our docs build on PRs lot by not needing to build all docs
Not yet for docs. Docs still need "airflow" level `breeze build-docs` but it's been always been possible to **just** run docs per provider `breeze build-docs apache.beam` for example - and it's even been always explained as output of the `breeze build-docs` command starting from yellow "still tired of waiting for documentation to be built?". But yes - ideally building docs from provider package directory should only build that package. That one is coming :). [image: Screenshot 2025-02-15 at 22.47.53.png] On Sat, Feb 15, 2025 at 10:43 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > 3) I don’t love the `tests/provider_tests` (I know why we did it) so > how about `tests/unit` instead? It makes it clearer what it is (cos > everything under `tests/` are provider tests so it’s a bit confusing at > first to see provider tests under a provider folder. > > Yep. Unit is more consistent with `integration` and `system` and equally > unlikely to be "taken" by a 3rd-party package. So if others are for it, I > am happy to change it this way. > > > One question that wasn’t clear to me (maybe it’s not 100% relevant) is > which of these folders will and won’t have init.py’s in them? > > Yeah. It's not entirely clear. I have explained that a bit in a comment in > the PR - but might be worth to copy it here. It's really derivative of the > __init__.py in airflow and likely a complete separate discussion. > > Currently all packages in tests have __init__,.py. - but in some cases > they are namespace packages (when they appear in more providers) and in > some cases they are just "licence only __init__.py", but likely - if we get > rid of the "legacy namespace package" from airflow (i.e. remove completely > airflow.__init__.py) - we could also remove all the `__init__.py` files > from all test code. > > More explanation here: > https://github.com/apache/airflow/pull/46608#discussion_r1957137894 > > And I copy it here: > > Yes, it is needed, because with this change we have the same "packages" > in different places in source tree. For example when you have: > > amazon/tests/system/amazon/.... > apache/cassandra/tests/system/apache/cassandra > And you want to do import system.apache.cassandra and you have amazon's > system in the PYTHONPATH first, you will get no such package system.apache. > > This is because those packages must be declared as "namespace" packages. > > The __path__ = __import__("pkgutil").extend_path(__path__, __name__) is > the canonical way of implementing it (unless you have implicit - or so > called native - namespace packages). > > What we are using now is so called "legacy namespace packages" - this is > described here > https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#legacy-namespace-packages > > The reason why we are using legacy namespace packages insttead of native > ones, is mainly due to compatibiity with airflow package itself. In Airflow > we are already using legacy namespace packages for "airflow.providers" - > https://github.com/apache/airflow/blob/main/airflow/__init__.py#L26 - and > the problem is that you cannot really mix native and legacy packages. > > See the quotes from the documentation of namespace packages: > > These two methods, that were used to create namespace packages prior to > PEP 420, are now considered to be obsolete and should not be used unless > you need compatibility with packages already using this method. Also, > pkg_resources has been deprecated. > > And taking into acount this: > > While native namespace packages and pkgutil-style namespace packages are > largely compatible, pkg_resources-style namespace packages are not > compatible with the other methods. It’s inadvisable to use different > methods in different distributions that provide packages to the same > namespace. > > We are not supposed to mix the two methods. So we don't. But it also means > that we have to make sure that all the packages that are potentially > appearing in several places in our source tree are "legacy" namespace > packages. > > In order to switch fully to native namespace packages (i.e. not having > an__init__.py at all) - we would have to do something that I strongly > advocate for, i.e. stop doing any work while we are importing airlfow. > Currently airflow's __init__.py does a lot of things - initalizes settings, > sqlalchemy, logging and a number of other things. > > And we do a lot of workarounds to minimise side effects of it -> lazy > loading stuff in a number of places, not initializing ORM in other places. > Maybe when we complete Airflow packaging work for Airlfow 3 we will get to > the point that this initialisation and lazy loading will be removed and we > just explicitly do initialization when we run appropriate "airflow" > commands (which is something I advocate for a long time). > > This whole "pkgutil" dance we do now could be removed if we do so - and > all the packages in airlfow and providers could be "native namespace > package" and we could get rid of about 100 __init__.py files (including the > airflow.__init__.py one ) in our repo. We could also stop having the > partially initialized module (most likely due to a circular import) pesky > errors we and a number of our users experience - because having all the > initialization happening in airflow.__init__.py is the root cause for those. > > See > https://packaging.python.org/en/latest/guides/packaging-namespace-packages/ > for details on namespace packages. > > cc: @ashb @uranusjr @kaxil -> something for consideration in our > discussions on how to repackage airlfow soon. I keep on explaining why > running code in airflow.__init__.py is a bad idea and advocating for > removal of it and replace it with explicit initialization, yet that topic > have not been discussed yet, but I will plan to start a discussion about it > soon once we approach the packaging subject. I am not sure what's your > thinking is about this - I know you spent consirderable amount of time on > doing all the "lazy initalization" dance all over the places, and I think > it adds a lot of complexity to our code and only partially solves the > cicular imports problem. But I know @ashb has very strong feeling about > being able to do "from airlfow import Dag" - which more or less requires > all this complexity. I just don't think it's worth it. > > > J. > > > On Sat, Feb 15, 2025 at 10:32 PM Ash Berlin-Taylor <a...@apache.org> wrote: > >> 1) I love the separation and the subproject structure overall. Especially >> with docs in there it should help speed up our docs build on PRs lot by >> not needing to build all docs >> >> 2) no real opinion, seems useful. Let’s see if we can do it with a ruff >> rule first? It _might_ be possible. (It is possible to extend the ruff >> rules on a per subproject basis, check out task_sdk/pyproject.toml) >> >> 3) I don’t love the `tests/provider_tests` (I know why we did it) so how >> about `tests/unit` instead? It makes it clearer what it is (cos everything >> under `tests/` are provider tests so it’s a bit confusing at first to see >> provider tests under a provider folder. >> >> So this would be >> >> PROVIDER >> |- pyproject.toml >> |- provider.yaml >> |- src/ >> |. | airflow/providers/PROVIDER/ >> | |- PROVIDER CODE HERE >> |- docs/ >> |- tests/ >> | | unit/ >> | | | PROVIDER >> | | |- UNIT TEST CODE HERE >> | | integration/ (optional) >> | | | PROVIDER >> | | |- INTEGRATION TEST CODE HERE >> | | system/ (optional) >> | | | PROVIDER >> | | |- SYSTEM TEST CODE HERE >> >> >> One question that wasn’t clear to me (maybe it’s not 100% relevant) is >> which of these folders will and won’t have init.py’s in them? >> >> -ash >> >> > On 15 Feb 2025, at 19:34, Jens Scheffler <j_scheff...@gmx.de.INVALID> >> wrote: >> > >> > Not more to say. Positive to all three points. No better ideas. But time >> > will tell and we might need to revise in future if tooling changes. But >> > this is anyway the normal case. >> > >> > On 15.02.25 20:24, Aritra Basu wrote: >> >> Hi, >> >> Thanks for starting the conversation Jarek, >> >> 1. Given I can't think of a cleaner/better approach I'm for this. I do >> >> think the end goal of adding intellij/vscode config files should >> definitely >> >> be added because many might miss the instructions when they expect the >> ide >> >> to just work. >> >> 2. Very much yes, we definitely should have pre-commits to catch them. >> >> 3. I do prefer these names over the af prefix, though I would love to >> hear >> >> some alternatives >> >> -- >> >> Regards, >> >> Aritra Basu >> >> >> >> On Sat, 15 Feb 2025, 11:45 pm Jarek Potiuk, <ja...@potiuk.com> wrote: >> >> >> >>> Hello everyone, >> >>> >> >>> TL;DR; I wanted to discuss and agree on the final setup for provider's >> >>> tests package internal structure >> >>> >> >>> Now that first stage of cleanup after providers move is merged ( >> >>> https://github.com/apache/airflow/pull/46608 - thanks Jens for a >> thorough >> >>> review of the 608 file+ beast of a change) I would like to explain >> current >> >>> approach for tests setup in providers and some of the consequences, >> and ask >> >>> others what they think. >> >>> >> >>> Currently each of the provider's sub-projects look like that more or >> less >> >>> (this is all inside "providers" folder in airflow): >> >>> >> >>> PROVIDER >> >>> |- pyproject.toml >> >>> |- provider.yaml >> >>> |- src >> >>> |. | airflow.providers.PROVIDER >> >>> | |- PROVIDER CODE HERE >> >>> |- docs >> >>> |- tests >> >>> | | provider_tests >> >>> | | | PROVIDER >> >>> | | |- UNIT TEST CODE HERE >> >>> | | integration (optional) >> >>> | | | PROVIDER >> >>> | | |- INTEGRATION TEST CODE HERE >> >>> | | system (optional) >> >>> | | | PROVIDER >> >>> | | |- SYSTEM TEST CODE HERE >> >>> >> >>> This structure is pretty "standalone" and allows the code in tests of >> >>> providers to import other test provider code without going "out" to >> >>> airflow. I implemented code in "tests_common" that automatically adds >> >>> "tests" for all providers to PYTHONPATH so what you can do now much >> more >> >>> reasonable imports inside the provider test code (i.e. reusing some >> code >> >>> inside tests for each provider): >> >>> >> >>> This is for example what is now >> >>> in >> >>> >> providers/standard/tests/provider_tests/standard/decorators/test_python.py: >> >>> >> >>> *from provider_tests.standard.operators.test_python import >> BasePythonTest* >> >>> >> >>> This is pretty cool, because before the cleanup the imports looked >> like >> >>> this (they were importing starting from the root of the airflow >> project): >> >>> >> >>> *from >> >>> providers.standard.tests.provider_tests.standard.operators.test_python >> >>> import BasePythonTest* >> >>> >> >>> This was bad - because none of the "providers.standard.tests" were >> real >> >>> "python packages" - they were just folders that we used to place the >> >>> "standard provider" and its tests in - we artificially made them >> "python >> >>> packages" by placing `__init__.py` files in them. This is no longer >> there. >> >>> >> >>> This has a certain consequence: if you want to make your IDE to >> understand >> >>> the imports, you have to mark the "tests" folder as "root of test >> sources" >> >>> (at least in IntelliJ). >> >>> >> >>> This is easy to justify and understand (and a very, very reasonable >> >>> assumption) and in a number of cases (for example when you open the >> >>> provider code as a separate project, this will even be auto-detected. >> >>> >> >>> I will add soon (after we finalize this discussion) a bit more >> instructions >> >>> and explanation for various scenarios of opening projects and using >> various >> >>> IDEs in order to develop provider's code to make it clearer - and >> maybe >> >>> later on we could even commit some of the intellij/.vscode project >> >>> configuration (just PYTHONPATH) so that people will not have to do it >> >>> manually. >> >>> >> >>> Eventually (we are not yet fully there but this is coming) - >> contributors >> >>> in the future should be able to just point to the provider's folder >> as a >> >>> project, run `uv sync` there and develop provider without loading and >> >>> opening "airflow" project - run all the tests, build documentation >> etc. >> >>> without even touching the parent airflow folders. >> >>> >> >>> Now.... My questions are: >> >>> >> >>> *1)* do you like this approach? Any comments? >> >>> >> >>> *2)* should we add some pre-commits to verify and check that we only >> use >> >>> the right "from" imports in our tests (there is a possibility that >> someone >> >>> will still use longer names, from content root "from >> providers.standard" ? >> >>> I think we should as it is likely to happen. >> >>> >> >>> *3)* are the top-level package names I chose good ? >> >>> >> >>> I've chosen "provider_tests", "system", "integration" - but maybe >> those are >> >>> not the best names and we should change them ? They are a bit >> inconsistent, >> >>> and also "system" and "integration" , while common, are unlikely to >> be used >> >>> by other packages installed. Previously we could not do such import >> >>> shortening because for example "smtp" provider clashed with "smtp" >> built-in >> >>> package - adding "provider_tests", "System >> >>> >> >>> I had other ideas such as "af_unit", "af_integration", "af_system" >> (they >> >>> should not be very long). But those did not feel right. Somewhat I >> like the >> >>> inconsistency - as integration and system tests are very much >> different >> >>> from the unit tests, even if we still use pytest as runners. >> >>> >> >>> Also - those changes are likely impacting those who run system >> dashboards, >> >>> so you will have to adapt slightly the paths you use in your system >> tests >> >>> :) >> >>> >> >>> Let me know what you think of all that :) >> >>> >> >>> J. >> >>> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> > For additional commands, e-mail: dev-h...@airflow.apache.org >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org >> For additional commands, e-mail: dev-h...@airflow.apache.org >> >>