[I] Provider package performance improvements via lazy loading 3rd party packages [airflow]

via GitHub Mon, 25 May 2026 17:07:55 -0700


dwreeves opened a new issue, #67515:
URL: https://github.com/apache/airflow/issues/67515

### Description

_No response_

### Use case/motivation

_No response_

### Related issues

# TLDR

- There are a lot of “easy” performance wins via lazy loading across many
provider packages, many of which are fairly popular.
- Lazy loading makes sense as an optimization because, throughout DAG code’s
lifecycle, the actual 3rd party packages get executed a mere fraction of the
time that the DAG code needs to get parsed.
- The biggest wins will be in the following provider packages: `google`,
`alibaba`, `teradata`, `papermill`, `neo4j`, `databricks`, `samba`,
`elasticsearch`.

# Background and Motivation

## Why?

Let’s say the user uses 3 different providers in that single DAG. But, only
one operator can be executed at a time. If each of the 3 provider imports is
bringing along a bunch of superfluous stuff into `sys.modules` that is never
actually used, this is an unnecessary performance degradation across DAG
parsing and execution of other tasks.

On the margin, lazy loading improves performance by reducing the memory
footprint and speeding up start-ups during the scheduler’s DAG file processing
and during task execution.

These changes are also fairly easy and low risk to implement. Linting tends
to detect errors of objects not existing if they are removed from globals but
then not properly lazy loaded. The most annoying part about implementing this
is fixing unit test mocks.

## PRs done so far

A handful of the last couple of PRs I’ve contributed to Airflow were aimed
at lazy-loading 3rd party packages in provider packages:

- #67479
- #62365

For selfish reasons, I’ve specifically targeted provider packages that I
personally use (Snowflake and Slack) and with which I have experienced
occasional OOM issues on small workers.

# Research and Methodology

I decided to modify (read: have Claude modify) the script I was using to
benchmark the performance gains in lazy-loading individual packages (Slack and
Snowflake) to run on _all_ provider packages, and find areas for performance
improvement across Airflow.

The simple version is— I am looking at the “delta” (in clock time and
memory) between loading things which are necessary for task execution plus
BaseHook and BaseOperator, and then importing all the modules in the provider
package. I then sort the packages by their deltas to identify the worst
offenders, and I also look at the individual packages contributing to the delta.

Measuring deltas for packages in isolation is conceptually imperfect since A
can take a long time via B, but if B is necessary to import globally then A
isn’t contributing much overhead. Still, this is good enough for getting a
sense of where problems may lie.

A full markdown report is here:
https://gist.github.com/dwreeves/d3c35354c2305b9a81d0d67a0280830a#file-airflow_optimization_report-md
The script is at the bottom of the gist, and you can run it to generate the
report or to just investigate individual packages:

# Learnings

The biggest wins will be in the following provider packages: `google`,
`alibaba`, `teradata`, `papermill`, `neo4j`, `databricks`, `samba`,
`elasticsearch`.

Pandas tends to be a common major contributor to load delta, as well as the
namesake packages. On my M4 Macbook, Pandas adds 150ms of load time and 65.3MB
of additional memory.

# How to handle this?

This is the tricky part. It is protocol for contributors to test their own
changes for provider packages during the alpha release. However, honestly most
of these packages identified in this analysis are ones I don’t actually use. So
although I can write the code to modify the provider packages, I would not be
able to test the changes.

I do believe there are a lot of changes that are very straightforward that I
could take on based on a reasonable trade-off between complexity, performance
gain, and provider package popularity.

It's unclear if this should be one big PR or individual PRs per provider
package.

### Are you willing to submit a PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Provider package performance improvements via lazy loading 3rd party packages [airflow]

Reply via email to