I would like to raise another discussion here - about fixing `airflow.__init__.py` excessive initialization pattern - potentially. It results from https://github.com/apache/airflow/pull/52952#discussion_r2188492257 discussion.
This is something we have been seeing for quite some time in Airflow 1 and 2 and now we still have some problems with it in Airflow 3, and I think with completing Task Isolation work, we have a chance to straighten it out. Currently, we just do a LOT of stuff when we do `import airflow` - initializing configurations, settings, secrets, registering ORM models .. you name it.. This is - likely as it has never been documented so I am guessing the root cause now - result of the philosophy that "import airflow" should get you up and running and everything needed should be already "ready for use". This allows for example to open a REPL in python in airflow venv, do "import airflow" - and everything you would like to do should be possible to do. And it's coming from the highly monolithic architecture of Airflow where we had just one package. And I think we do not have to hold to this assumption/expectation. The thing is that the whole environment is changing in Airflow 3 and it will change even further when task isolation is completed. We simply do not have a monolithic structure of packages and we have several distributions sharing "airflow" and they might or might not be installed together which adds a lot of complexity if we rely on "__init__.py" code being executed. While (years ago) I proposed in the past to make separate "top level" packages (for example "airflow_providers" for providers) - this proposal has been rejected by the community and "airflow" became the common "root" package for everything, At the same time it causes that the common "initialization" code is shared - but not really - because sometimes our distributions can be installed together, sometimes separately - and we need to handle a lot of complexity and implement some hacks to make this "common" initialization to work in all scenarios. And it leads to a number of complexities and problems we (and our users) often experience: * there are often "module not fully initialized" errors that are difficult to debug and fix when we are trying to import parts of airflow from other modules that are "being initialized" (logging, secrets managers are particularly susceptible to that) - we have a lot of "local imports" and other ways to deal with it. * we have a lot of "lazy-loading" implemented - in both production code and tests - just to handle the conditional nature of some things - for example @provider_configurations_loaded decorator is implemented specifically to defer initializing providers when they are going to be used. This is not the "best" pattern but one that works in the circumstances of init doing a lot - and it's a direct result of us doing this heavy initialisation. It could have been simplified if we do explicit initialization of things when needed in specific CLI commands * our "plugins" interface that used to be "all-in-one" is now pretty fragmented across what needs to be initialized where. While Scheduler needs "timetable" plugins, it does not need "macros" nor "fast_api_apps" and it should not initialize them, but "webserver" on the other hand needs "fast_api_apps" and worker also needs "global_operator_links" (this is a recent change I think - they used to be rendered in web server). * we have hard time on deciding when we should do certain parts of initialization - for example currently plugins manager is initialized in "import airflow" effectively - and it means that the only way to find out what is the "cli" command we run is look at the arguments of interpreter - so that we can "guess" if we are run as worker or api_server - because after the split, we are not supposed to always initialize all plugins - so current implementation in #52952 is ....weird.... out of necessity: # Load the API endpoint only on api-server (Airflow 3.x) or webserver (Airflow 2.x) if AIRFLOW_V_3_0_PLUS: RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if len(sys.argv) > 1 else False else: RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and "airflow-webserver" in sys.argv *Now, how to fix it? * I think the answer is in Python Zen "explicit is better than implicit". I think we could simplify a lot of code if we drop the assumption that "import airflow" does everything for you. In fact it should do pretty much **nothing**. Then whenever a particular CLI of airflow is run, we should explicitly initialize whatever we need. Say: * airflow api_server -> configuration, settings, database, fast_api_server and main "airflow" app * celery worker -> configuration, settings, task_sdk, fast_api_server and "serve_logs" app, "macro plugins". "global_operator_links", * scheduler -> configuration, settings, database, timetable plugins, etc. etc. In always the right sequence (this matters a lot and it is currently one of the sources of problems that depending which package you import first our lazy loading might work differently), with minimal lazy loading - i.e minimal implicitness. I attempted to do it partially in the past (I guess 3 times) and failed miserably because of intermixing of configuration, settings and database - but with a lot of work being done on task isolation, I think a lot of the roadblocks there are either being handled or handled already. Also I think it's not a "breaking" change. We never actually promised that "import airflow" does all the initialization. If this is relied on - it's mostly in CI/ tests etc. and should be easily remediated by providing appropriate initialization calls (and appropriate sequence of those initializations. I am happy to lead that effort if we agree this is a good direction. It might already also be kind of planned (explicitly or implicitly) as part of task isolation work - so maybe what I am writing about have already been taken into account (but I have not seen it explicitly addressed) and I am happy to help there as well. I would love to hear your opinions on that. J.