I would like to raise another discussion here - about fixing
`airflow.__init__.py` excessive initialization pattern - potentially. It
results from
https://github.com/apache/airflow/pull/52952#discussion_r2188492257
discussion.

This is something we have been seeing for quite some time in Airflow 1 and
2 and now we still have some problems with it in Airflow 3, and I think
with completing Task Isolation work, we have a chance to straighten it out.

Currently, we just do a LOT of stuff when we do `import airflow` -
initializing configurations, settings, secrets, registering ORM models ..
you name it..

This is - likely as it has never been documented so I am guessing the root
cause now - result of the philosophy that "import airflow" should get you
up and running and everything needed should be already "ready for use".
This allows for example to open a REPL in python in airflow venv, do
"import airflow" - and everything you would like to do should be possible
to do. And it's coming from the highly monolithic architecture of Airflow
where we had just one package. And I think we do not have to hold to this
assumption/expectation.

The thing is that the whole environment is changing in Airflow 3 and it
will change even further when task isolation is completed. We simply do not
have a monolithic structure of packages and we have several distributions
sharing "airflow" and they might or might not be installed together which
adds a lot of complexity if we rely on "__init__.py" code being executed.

While (years ago) I proposed in the past to make separate "top level"
packages (for example "airflow_providers" for providers) - this proposal
has been rejected by the community and "airflow" became the common "root"
package for everything, At the same time it causes that the common
"initialization" code is shared - but not really - because sometimes our
distributions can be installed together, sometimes separately - and we need
to handle a lot of complexity and implement some hacks to make this
"common" initialization to work in all scenarios.

And it leads to a number of complexities and problems we (and our users)
often experience:

* there are often "module not fully initialized" errors that are difficult
to debug and fix when we are trying to import parts of airflow from other
modules that are "being initialized" (logging, secrets managers are
particularly susceptible to that) - we have a lot of "local imports" and
other ways to deal with it.

* we have a lot of "lazy-loading" implemented - in both production code and
tests - just to handle the conditional nature of some things - for
example @provider_configurations_loaded decorator is implemented
specifically to defer initializing providers when they are going to be
used. This is not the "best" pattern but one that works in the
circumstances of init doing a lot  - and it's a direct result of us doing
this heavy initialisation. It could have been simplified if we do explicit
initialization of things when needed in specific CLI commands

* our "plugins" interface that used to be "all-in-one" is now pretty
fragmented across what needs to be initialized where. While Scheduler needs
"timetable" plugins, it does not need "macros" nor "fast_api_apps" and it
should not initialize them, but "webserver" on the other hand needs
"fast_api_apps" and worker also needs "global_operator_links" (this is a
recent change I think - they used to be rendered in web server).

* we have hard time on deciding when we should do certain parts of
initialization - for example currently plugins manager is initialized in
"import airflow" effectively - and it means that the only way to find out
what is the "cli" command we run is look at the arguments of interpreter -
so that we can "guess" if we are run as worker or api_server - because
after the split, we are not supposed to always initialize all plugins - so
current implementation in  #52952 is ....weird.... out of necessity:

# Load the API endpoint only on api-server (Airflow 3.x) or webserver
(Airflow 2.x)
if AIRFLOW_V_3_0_PLUS:
    RUNNING_ON_APISERVER = sys.argv[1] in ["api-server"] if len(sys.argv) >
1 else False
else:
    RUNNING_ON_APISERVER = "gunicorn" in sys.argv[0] and
"airflow-webserver" in sys.argv

*Now, how to fix it? *

I think the answer is in Python Zen "explicit is better than implicit". I
think we could simplify a lot of code if we drop the assumption that
"import airflow" does everything for you. In fact it should do pretty much
**nothing**. Then whenever a particular CLI of airflow is run, we should
explicitly initialize whatever we need.

Say:

* airflow api_server -> configuration, settings, database, fast_api_server
and main "airflow" app
* celery worker -> configuration, settings, task_sdk, fast_api_server and
"serve_logs" app, "macro plugins". "global_operator_links",
* scheduler -> configuration, settings, database, timetable plugins,

etc. etc.  In always the right sequence (this matters a lot and it is
currently one of the sources of problems that depending which package you
import first our lazy loading might work differently), with minimal lazy
loading - i.e minimal implicitness.

I attempted to do it partially in the past (I guess 3 times) and failed
miserably because of intermixing of configuration, settings and database -
but with a lot of work being done on task isolation, I think a lot of the
roadblocks there are either being handled or handled already.

Also I think it's not a "breaking" change. We never actually promised that
"import airflow" does all the initialization. If this is relied on - it's
mostly in CI/ tests etc. and should be easily remediated by providing
appropriate initialization calls  (and appropriate sequence of those
initializations.

I am happy to lead that effort if we agree this is a good direction. It
might already also be kind of planned (explicitly or implicitly) as part of
task isolation work - so maybe what I am writing about have already been
taken into account (but I have not seen it explicitly addressed) and I am
happy to help there as well.

I would love to hear your opinions on that.

J.

Reply via email to