dwreeves opened a new issue, #67515:
URL: https://github.com/apache/airflow/issues/67515

   ### Description
   
   _No response_
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   # TLDR
   
   - There are a lot of “easy” performance wins via lazy loading across many 
provider packages, many of which are fairly popular.
   - Lazy loading makes sense as an optimization because, throughout DAG code’s 
lifecycle, the actual 3rd party packages get executed a mere fraction of the 
time that the DAG code needs to get parsed.
   - The biggest wins will be in the following provider packages: `google`, 
`alibaba`, `teradata`, `papermill`, `neo4j`, `databricks`, `samba`, 
`elasticsearch`.
   
   # Background and Motivation
   
   ## Why?
   
   Let’s say the user uses 3 different providers in that single DAG. But, only 
one operator can be executed at a time. If each of the 3 provider imports is 
bringing along a bunch of superfluous stuff into `sys.modules` that is never 
actually used, this is an unnecessary performance degradation across DAG 
parsing and execution of other tasks.
   
   On the margin, lazy loading improves performance by reducing the memory 
footprint and speeding up start-ups during the scheduler’s DAG file processing 
and during task execution.
   
   These changes are also fairly easy and low risk to implement. Linting tends 
to detect errors of objects not existing if they are removed from globals but 
then not properly lazy loaded. The most annoying part about implementing this 
is fixing unit test mocks.
   
   ## PRs done so far
   
   A handful of the last couple of PRs I’ve contributed to Airflow were aimed 
at lazy-loading 3rd party packages in provider packages:
   
   - #67479
   - #62365
   
   For selfish reasons, I’ve specifically targeted provider packages that I 
personally use (Snowflake and Slack) and with which I have experienced 
occasional OOM issues on small workers.
   
   # Research and Methodology
   
   I decided to modify (read: have Claude modify) the script I was using to 
benchmark the performance gains in lazy-loading individual packages (Slack and 
Snowflake) to run on _all_ provider packages, and find areas for performance 
improvement across Airflow.
   
   The simple version is— I am looking at the “delta” (in clock time and 
memory) between loading things which are necessary for task execution plus 
BaseHook and BaseOperator, and then importing all the modules in the provider 
package. I then sort the packages by their deltas to identify the worst 
offenders, and I also look at the individual packages contributing to the delta.
   
   Measuring deltas for packages in isolation is conceptually imperfect since A 
can take a long time via B, but if B is necessary to import globally then A 
isn’t contributing much overhead. Still, this is good enough for getting a 
sense of where problems may lie.
   
   A full markdown report is here: 
https://gist.github.com/dwreeves/d3c35354c2305b9a81d0d67a0280830a#file-airflow_optimization_report-md
 The script is at the bottom of the gist, and you can run it to generate the 
report or to just investigate individual packages:
   
   <img width="1200" height="753" alt="Image" 
src="https://github.com/user-attachments/assets/cf5fcaaa-c844-4985-a4b0-09b339a6a555";
 />
   
   # Learnings
   
   The biggest wins will be in the following provider packages: `google`, 
`alibaba`, `teradata`, `papermill`, `neo4j`, `databricks`, `samba`, 
`elasticsearch`.
   
   Pandas tends to be a common major contributor to load delta, as well as the 
namesake packages. On my M4 Macbook, Pandas adds 150ms of load time and 65.3MB 
of additional memory.
   
   # How to handle this?
   
   This is the tricky part. It is protocol for contributors to test their own 
changes for provider packages during the alpha release. However, honestly most 
of these packages identified in this analysis are ones I don’t actually use. So 
although I can write the code to modify the provider packages, I would not be 
able to test the changes.
   
   I do believe there are a lot of changes that are very straightforward that I 
could take on based on a reasonable trade-off between complexity, performance 
gain, and provider package popularity.
   
   It's unclear if this should be one big PR or individual PRs per provider 
package.
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to