One of the options would be to use a similar approach as terraform uses - i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket than DAG files. Since we know there must be an S3 available (obviously) - it seems not too excessive to assume that there might be another bucket, independent of the DAG bucket where the state is stored - same bucket (and dedicated connection id) could even be used to store state for multiple S3 dag bundles - each Dag bundle could have a dedicated object storing the state. The metadata is not huge, so continuously reading and replacing it should not be an issue.
What's nice about it - this single object could even **actually** use S3 versioning to keep historical state - to optimize things and keep a log of changes potentially. J. On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko <oniko...@amazon.com.invalid> wrote: > Hey folks, > > tl;dr I’d like to get some thoughts on a proposal to use DB Manager for S3 > Dag Bundle versioning. > > The initial commit for S3 Dag Bundles was recently merged [1] but it lacks > Bundle versioning (since this isn’t trivial with something like S3, like it > is with Git). The proposed solution involves building a snapshot of the S3 > bucket at the time each Bundle version is created, noting the version of > all the objects in the bucket (using S3’s native bucket versioning feature) > and creating a manifest to store those versions and then giving that whole > manifest itself some unique id/version/uuid. These manifests now need to be > stored somewhere for future use/retrieval. The proposal is to use the > Airflow database using the DB Manager feature. Other options include using > the local filesystem to store them (but this obviously wont work in > Airflow’s distributed architecture) or the S3 bucket itself (but this > requires write access to the bucket and we will always be at the mercy of > the user accidentally deleting/modifying the manifests as they try to > manage the lifecycle of their bucket, they should not need to be aware of > or need to account for this metadata). So the Airflow DB works nicely as a > persistent and internally accessible location for this data. > > But I’m aware of the complexities of using the DB Manager and the > discussion we had during the last dev call about providers vending DB > tables (concerning migrations and ensuring smooth upgrades or downgrades of > the schema). So I wanted to reach out to see what folks thought. I have > talked to Jed, the Bundle Master (tm), and we haven’t come up with anything > else that solves the problem as cleanly, so the DB Manager is still my top > choice. I think what we go with will pave the way for other Bundle > providers of a similar type as well, so it's worth thinking deeply about > this decision. > > Let me know what you think and thanks for your time! > > Cheers, > Niko > > [1] https://github.com/apache/airflow/pull/46621 >