ephraimbuddy opened a new issue, #68567:
URL: https://github.com/apache/airflow/issues/68567

   ### Description
   
   `SerializedDagModel.write_dag`'s "serialized hash unchanged" fast path 
refreshes `DagVersion.bundle_version`/`version_data` in place, comparing the 
full stored `version_data` against the incoming value:
   
   ```python
   # airflow-core/src/airflow/models/serialized_dag.py
   bundle_metadata_changed = (
       dag_version.bundle_version != bundle_version or dag_version.version_data 
!= version_data
   )
   ```
   
   `version_data` is a free-form JSON column (e.g. an S3/custom-bundle 
manifest). When it is large, two things get expensive on every parse:
   
   1. `_prefetch_dag_write_metadata` loads the **full** `DagVersion` row — 
including the entire `version_data` JSON — for every DAG in the bulk write.
   2. The steady-state same-bundle case re-compares the full `version_data` 
dict each parse (only skipped when `bundle_version` already differs, thanks to 
`or` short-circuiting).
   
   Proposal: persist a `version_data_hash` (e.g. md5 of the canonical JSON) on 
`dag_version` and compare/prefetch that instead of the full blob. The prefetch 
then loads only the small hash, and the change check compares hashes.
   
   ### Use case/motivation
   
   Keep DB-side parsing cheap and memory-flat as `version_data` grows (large 
manifests from S3/custom bundles). Today the built-in bundles keep 
`version_data` small or empty (GitDagBundle doesn't set it), so this is a 
forward-looking optimization rather than a current hotspot — surfaced in review 
of #68336.
   
   ### Related issues
   
   Follow-up from review on #68336 (review comment by @uranusjr). The in-place 
refresh logic was introduced there.
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   
   ---
   Drafted-by: Claude Code (Opus 4.8); reviewed by @ephraimbuddy before posting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to