> This might be related, or it might not be, but I think I would also love it if we moved all of “core” (scheduler, jobs, api server etc) to airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning `airflow` would be left for just `airflow.sdk` and `airflow.providers`, plus some compat shims, possibly installed by apache-airflow-task-sdk itself). Were you thinking something similar?
I do not have exact details yet, it's more about "changing the philosophy of initialisation". I think it would need some POC to come up with some details (but unfortunately such POC will require quite an investment and when done it would be almost complete - as there are so many intertwined things in our initialization that you only find out stuff after you move things :) . That's my experience from previous attempts. Usually it started with - hey I can move this and that here and we will be good, but after doing it, it turned out that the other parts have to be also touched and it caused an avalanche of changes ripping through the whole codebase almost (to the point that I gave up). But yes that might be one of the ways to achieve that. I am all for trying it and seeing how it might work out. J On Mon, Jul 7, 2025 at 2:49 PM Ash Berlin-Taylor <a...@apache.org> wrote: > Yeah, this has been a long time bugbear of mine and would love to remove > the magic and the side-effects of `import airflow`. > > Do you have any plans or thoughts about how to actually achieve this? > > This might be related, or it might not be, but I think I would also love > it if we moved all of “core” (scheduler, jobs, api server etc) to > airflow_core.* python modules , and out of `airflow.*` entirely. (Meaning > `airflow` would be left for just `airflow.sdk` and `airflow.providers`, > plus some compat shims, possibly installed by apache-airflow-task-sdk > itself). Were you thinking something similar? > > -ash > > > On 7 Jul 2025, at 12:35, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > I would like to raise another discussion here - about fixing > > `airflow.__init__.py` excessive initialization pattern - potentially. It > > results from > > https://github.com/apache/airflow/pull/52952#discussion_r2188492257 > > discussion. > > > > This is something we have been seeing for quite some time in Airflow 1 > and > > 2 and now we still have some problems with it in Airflow 3, and I think > > with completing Task Isolation work, we have a chance to straighten it > out. > > > > Currently, we just do a LOT of stuff when we do `import airflow` - > > initializing configurations, settings, secrets, registering ORM models .. > > you name it.. > > > > This is - likely as it has never been documented so I am guessing the > root > > cause now - result of the philosophy that "import airflow" should get you > > up and running and everything needed should be already "ready for use". > > This allows for example to open a REPL in python in airflow venv, do > > "import airflow" - and everything you would like to do should be possible > > to do. And it's coming from the highly monolithic architecture of Airflow > > where we had just one package. And I think we do not have to hold to this > > assumption/expectation. > > > > The thing is that the whole environment is changing in Airflow 3 and it > > will change even further when task isolation is completed. We simply do > not > > have a monolithic structure of packages and we have several distributions > > sharing "airflow" and they might or might not be installed together which > > adds a lot of complexity if we rely on "__init__.py" code being executed. > > > > While (years ago) I proposed in the past to make separate "top level" > > packages (for example "airflow_providers" for providers) - this proposal > > has been rejected by the community and "airflow" became the common "root" > > package for everything, At the same time it causes that the common > > "initialization" code is shared - but not really - because sometimes our > > distributions can be installed together, sometimes separately - and we > need > > to handle a lot of complexity and implement some hacks to make this > > "common" initialization to work in all scenarios. > > > > And it leads to a number of complexities and problems we (and our users) > > often experience: > > > > * there are often "module not fully initialized" errors that are > difficult > > to debug and fix when we are trying to import parts of airflow from other > > modules that are "being initialized" (logging, secrets managers are > > particularly susceptible to that) - we have a lot of "local imports" and > > other ways to deal with it. > > > > * we have a lot of "lazy-loading" implemented - in both production code > and > > tests - just to handle the conditional nature of some things - for > > example @provider_configurations_loaded decorator is implemented > > specifically to defer initializing providers when they are going to be > > used. This is not the "best" pattern but one that works in the > > circumstances of init doing a lot - and it's a direct result of us doing > > this heavy initialisation. It could have been simplified if we do > explicit > > initialization of things when needed in specific CLI commands > > > > * our "plugins" interface that used to be "all-in-one" is now pretty > > fragmented across what needs to be initialized where. While Scheduler > needs > > "timetable" plugins, it does not need "macros" nor "fast_api_apps" and it > > should not initialize them, but "webserver" on the other hand needs > > "fast_api_apps" and worker also needs "global_operator_links" (this is a > > recent change I think - they used to be rendered in web server). > > > > * we have hard time on deciding when we should do certain parts of > > initialization - for example currently plugins manager is initialized in > > "import airflow" effectively - and it means that the only way to find out > > what is the "cli" command we run is look at the arguments of interpreter > - > > so that we can "guess" if we are run as worker or api_server - because > > after the split, we are not supposed to always initialize all plugins - > so > > current implementation in #52952 is ....weird.... out of necessity: > > > > # Load the API endpoint only on api-server (Airflow 3.x) or webserver > > (Airflow 2.x) > > if AIRFLOW_V_3_0_PLUS: > > RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if len(sys.argv) > > > > 1 else False > > else: > > RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and > > "airflow-webserver" in sys.argv > > > > *Now, how to fix it? * > > > > I think the answer is in Python Zen "explicit is better than implicit". I > > think we could simplify a lot of code if we drop the assumption that > > "import airflow" does everything for you. In fact it should do pretty > much > > **nothing**. Then whenever a particular CLI of airflow is run, we should > > explicitly initialize whatever we need. > > > > Say: > > > > * airflow api_server -> configuration, settings, database, > fast_api_server > > and main "airflow" app > > * celery worker -> configuration, settings, task_sdk, fast_api_server and > > "serve_logs" app, "macro plugins". "global_operator_links", > > * scheduler -> configuration, settings, database, timetable plugins, > > > > etc. etc. In always the right sequence (this matters a lot and it is > > currently one of the sources of problems that depending which package you > > import first our lazy loading might work differently), with minimal lazy > > loading - i.e minimal implicitness. > > > > I attempted to do it partially in the past (I guess 3 times) and failed > > miserably because of intermixing of configuration, settings and database > - > > but with a lot of work being done on task isolation, I think a lot of the > > roadblocks there are either being handled or handled already. > > > > Also I think it's not a "breaking" change. We never actually promised > that > > "import airflow" does all the initialization. If this is relied on - it's > > mostly in CI/ tests etc. and should be easily remediated by providing > > appropriate initialization calls (and appropriate sequence of those > > initializations. > > > > I am happy to lead that effort if we agree this is a good direction. It > > might already also be kind of planned (explicitly or implicitly) as part > of > > task isolation work - so maybe what I am writing about have already been > > taken into account (but I have not seen it explicitly addressed) and I am > > happy to help there as well. > > > > I would love to hear your opinions on that. > > > > J. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > For additional commands, e-mail: dev-h...@airflow.apache.org > >