ephraimbuddy opened a new issue, #68567:
URL: https://github.com/apache/airflow/issues/68567
### Description
`SerializedDagModel.write_dag`'s "serialized hash unchanged" fast path
refreshes `DagVersion.bundle_version`/`version_data` in place, comparing the
full stored `version_data` against the incoming value:
```python
# airflow-core/src/airflow/models/serialized_dag.py
bundle_metadata_changed = (
dag_version.bundle_version != bundle_version or dag_version.version_data
!= version_data
)
```
`version_data` is a free-form JSON column (e.g. an S3/custom-bundle
manifest). When it is large, two things get expensive on every parse:
1. `_prefetch_dag_write_metadata` loads the **full** `DagVersion` row —
including the entire `version_data` JSON — for every DAG in the bulk write.
2. The steady-state same-bundle case re-compares the full `version_data`
dict each parse (only skipped when `bundle_version` already differs, thanks to
`or` short-circuiting).
Proposal: persist a `version_data_hash` (e.g. md5 of the canonical JSON) on
`dag_version` and compare/prefetch that instead of the full blob. The prefetch
then loads only the small hash, and the change check compares hashes.
### Use case/motivation
Keep DB-side parsing cheap and memory-flat as `version_data` grows (large
manifests from S3/custom bundles). Today the built-in bundles keep
`version_data` small or empty (GitDagBundle doesn't set it), so this is a
forward-looking optimization rather than a current hotspot — surfaced in review
of #68336.
### Related issues
Follow-up from review on #68336 (review comment by @uranusjr). The in-place
refresh logic was introduced there.
### Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
---
Drafted-by: Claude Code (Opus 4.8); reviewed by @ephraimbuddy before posting
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]