Agree another s3 bucket also works here Another option also would be Using dynamodb table? that also supports snapshots and i feel it works very well with state management.
Pavan On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com> wrote: > One of the options would be to use a similar approach as terraform uses - > i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket than > DAG files. Since we know there must be an S3 available (obviously) - it > seems not too excessive to assume that there might be another bucket, > independent of the DAG bucket where the state is stored - same bucket (and > dedicated connection id) could even be used to store state for multiple S3 > dag bundles - each Dag bundle could have a dedicated object storing the > state. The metadata is not huge, so continuously reading and replacing it > should not be an issue. > > What's nice about it - this single object could even **actually** use S3 > versioning to keep historical state - to optimize things and keep a log of > changes potentially. > > J. > > On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko <oniko...@amazon.com.invalid > > > wrote: > > > Hey folks, > > > > tl;dr I’d like to get some thoughts on a proposal to use DB Manager for > S3 > > Dag Bundle versioning. > > > > The initial commit for S3 Dag Bundles was recently merged [1] but it > lacks > > Bundle versioning (since this isn’t trivial with something like S3, like > it > > is with Git). The proposed solution involves building a snapshot of the > S3 > > bucket at the time each Bundle version is created, noting the version of > > all the objects in the bucket (using S3’s native bucket versioning > feature) > > and creating a manifest to store those versions and then giving that > whole > > manifest itself some unique id/version/uuid. These manifests now need to > be > > stored somewhere for future use/retrieval. The proposal is to use the > > Airflow database using the DB Manager feature. Other options include > using > > the local filesystem to store them (but this obviously wont work in > > Airflow’s distributed architecture) or the S3 bucket itself (but this > > requires write access to the bucket and we will always be at the mercy of > > the user accidentally deleting/modifying the manifests as they try to > > manage the lifecycle of their bucket, they should not need to be aware of > > or need to account for this metadata). So the Airflow DB works nicely as > a > > persistent and internally accessible location for this data. > > > > But I’m aware of the complexities of using the DB Manager and the > > discussion we had during the last dev call about providers vending DB > > tables (concerning migrations and ensuring smooth upgrades or downgrades > of > > the schema). So I wanted to reach out to see what folks thought. I have > > talked to Jed, the Bundle Master (tm), and we haven’t come up with > anything > > else that solves the problem as cleanly, so the DB Manager is still my > top > > choice. I think what we go with will pave the way for other Bundle > > providers of a similar type as well, so it's worth thinking deeply about > > this decision. > > > > Let me know what you think and thanks for your time! > > > > Cheers, > > Niko > > > > [1] https://github.com/apache/airflow/pull/46621 > > >