What about the DynamoDB idea ? What you are trying to trade-off is "writing to airflow metadata DB" with "writing to another DB" really. So yes it is - another thing you will need to have access to write to - other than Airflow DB, but it's really the question should the boundaries be on "Everything writable should be in Airflow" vs. "Everything writable should be in the "cloud" that the integration is about.
Yes - it makes the management using S3 versioning a bit more "write-y" - but on the other hand it does allow to confine complexity to a pure "amazon" provider - with practically 0 impact on Airflow core and airflow DB. Which I really like to be honest. And yes "co-location" is also my goal. And I think this is a perfect way to explain it as well why it is better to keep "S3 versioning" close to "S3" and not to Airflow - especially that there will be a lot of "S3-specific" things in the state that are not easy to abstract and have "common" for other Airflow versioning implementations. You can think about it this way: Airflow has already done its job with abstractions - versioning changes and metadata DB is implemented in Airflow DB. If there are any missing pieces in the abstraction that will be usable across multiple implementations of versioning, we should - of course - add it to Airflow metadata DB - in the way that they can be used by those different implementations. But the code to manage and use it should be in airflow-core. If there is anything specific for the implementation of S3 / Amazon integration -> it should be implemented independently from Airflow Metadata DB. There are many complexities in managing and upgrading core DB and we should not use the db to make provider-specific things. The discussion about shared code and isolation is interesting in this context. Because I think we might get to the point when we go deeper and deeper in this direction that we will have (and we already do it more or less) NO (regular) providers needed with whatever CLI or tooling we will need to manage the Metadata DB. FAB and Edge are currently exceptions - but they are by no means "regular" providers. So I'd say - if while designing/ implementing S3 versioning you will see that part of the implementation can be abstracted away and added to the core and used by other implementations - 100% - let's add it to the core. But only then. If it is something that only Amazon provider needs and S3 needs - let's make it use Amazon **whatever** as backing storage. I would even say - talk to the Google team and try to come up with an abstraction that can be used for versioning in both S3 and GCS, agree on it, and let's see if this abstraction should find its way to the core. That would be my proposal. J. On Wed, Jul 9, 2025 at 7:37 PM Oliveira, Niko <oniko...@amazon.com.invalid> wrote: > Thanks for engaging folks! > > I don’t love the idea of using another bucket. For one, this means Airflow > needs write access to S3 which is not ideal; some users/customers are very > sensitive about ever allowing write access to things. And two, you will > commonly get issues with a design that leaks state into customer managed > accounts/resources, they may delete the bucket not knowing what it is, they > may not migrate it to a new account or region if they ever move. I think > it’s best for the data to be stored transparently to the user and > co-located with the data it strongly relates to (i.e. the dag runs that are > associated with those bundle versions). > > Is using DB Manager completely unacceptable these days? What are folks' > thoughts on that? > > Cheers, > Niko > > ________________________________ > From: Jarek Potiuk <ja...@potiuk.com> > Sent: Wednesday, July 9, 2025 6:23:54 AM > To: dev@airflow.apache.org > Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que > le contenu ne présente aucun risque. > > > > > Another option also would be Using dynamodb table? that also supports > snapshots and i feel it works very well with state management. > > Yep that would also work. > > Anything "Amazon" to keep state would do. I think that it should be our > "default" approach that if we have to keep state and the state is connected > with specific "provider's" implementation, it's best to not keep state in > Airflow, but in the "integration" that the provider works with if possible. > We cannot do it in "generic" case because we do not know what > "integrations" the user has - but since this is "provider's" functionality, > using anything else that the given integration provides makes perfect > sense. > > J. > > > On Wed, Jul 9, 2025 at 3:12 PM Pavankumar Gopidesu < > gopidesupa...@gmail.com> > wrote: > > > Agree another s3 bucket also works here > > > > Another option also would be Using dynamodb table? that also supports > > snapshots and i feel it works very well with state management. > > > > > > Pavan > > > > On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > One of the options would be to use a similar approach as terraform > uses - > > > i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket > than > > > DAG files. Since we know there must be an S3 available (obviously) - it > > > seems not too excessive to assume that there might be another bucket, > > > independent of the DAG bucket where the state is stored - same bucket > > (and > > > dedicated connection id) could even be used to store state for multiple > > S3 > > > dag bundles - each Dag bundle could have a dedicated object storing the > > > state. The metadata is not huge, so continuously reading and replacing > it > > > should not be an issue. > > > > > > What's nice about it - this single object could even **actually** use > S3 > > > versioning to keep historical state - to optimize things and keep a > log > > of > > > changes potentially. > > > > > > J. > > > > > > On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko > > <oniko...@amazon.com.invalid > > > > > > > wrote: > > > > > > > Hey folks, > > > > > > > > tl;dr I’d like to get some thoughts on a proposal to use DB Manager > for > > > S3 > > > > Dag Bundle versioning. > > > > > > > > The initial commit for S3 Dag Bundles was recently merged [1] but it > > > lacks > > > > Bundle versioning (since this isn’t trivial with something like S3, > > like > > > it > > > > is with Git). The proposed solution involves building a snapshot of > the > > > S3 > > > > bucket at the time each Bundle version is created, noting the version > > of > > > > all the objects in the bucket (using S3’s native bucket versioning > > > feature) > > > > and creating a manifest to store those versions and then giving that > > > whole > > > > manifest itself some unique id/version/uuid. These manifests now need > > to > > > be > > > > stored somewhere for future use/retrieval. The proposal is to use the > > > > Airflow database using the DB Manager feature. Other options include > > > using > > > > the local filesystem to store them (but this obviously wont work in > > > > Airflow’s distributed architecture) or the S3 bucket itself (but this > > > > requires write access to the bucket and we will always be at the > mercy > > of > > > > the user accidentally deleting/modifying the manifests as they try to > > > > manage the lifecycle of their bucket, they should not need to be > aware > > of > > > > or need to account for this metadata). So the Airflow DB works nicely > > as > > > a > > > > persistent and internally accessible location for this data. > > > > > > > > But I’m aware of the complexities of using the DB Manager and the > > > > discussion we had during the last dev call about providers vending DB > > > > tables (concerning migrations and ensuring smooth upgrades or > > downgrades > > > of > > > > the schema). So I wanted to reach out to see what folks thought. I > have > > > > talked to Jed, the Bundle Master (tm), and we haven’t come up with > > > anything > > > > else that solves the problem as cleanly, so the DB Manager is still > my > > > top > > > > choice. I think what we go with will pave the way for other Bundle > > > > providers of a similar type as well, so it's worth thinking deeply > > about > > > > this decision. > > > > > > > > Let me know what you think and thanks for your time! > > > > > > > > Cheers, > > > > Niko > > > > > > > > [1] https://github.com/apache/airflow/pull/46621 > > > > > > > > > >