Re: S3 Dag Bundle Versions and DB Manager

Oliveira, Niko Wed, 09 Jul 2025 10:38:24 -0700

Thanks for engaging folks!

I don’t love the idea of using another bucket. For one, this means Airflow 
needs write access to S3 which is not ideal; some users/customers are very 
sensitive about ever allowing write access to things. And two, you will 
commonly get issues with a design that leaks state into customer managed 
accounts/resources, they may delete the bucket not knowing what it is, they may 
not migrate it to a new account or region if they ever move. I think it’s best 
for the data to be stored transparently to the user and co-located with the 
data it strongly relates to (i.e. the dag runs that are associated with those 
bundle versions).


Is using DB Manager completely unacceptable these days? What are folks' 
thoughts on that?

Cheers,
Niko

________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Wednesday, July 9, 2025 6:23:54 AM
To: dev@airflow.apache.org
Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
contenu ne présente aucun risque.



> Another option also would be Using dynamodb table? that also supports
snapshots and i feel it works very well with state management.

Yep that would also work.

Anything "Amazon" to keep state would do. I think that it should be our
"default" approach that if we have to keep state and the state is connected
with specific "provider's" implementation, it's best to not keep state in
Airflow, but in the "integration" that the provider works with if possible.
We cannot do it in "generic" case because we do not know what
"integrations" the user has - but since this is "provider's" functionality,
using anything else that the given integration provides makes perfect sense.

J.


On Wed, Jul 9, 2025 at 3:12 PM Pavankumar Gopidesu <gopidesupa...@gmail.com>
wrote:

> Agree another s3 bucket also works here
>
> Another option also would be Using dynamodb table? that also supports
> snapshots and i feel it works very well with state management.
>
>
> Pavan
>
> On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > One of the options would be to use a similar approach as terraform uses -
> > i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket than
> > DAG files. Since we know there must be an S3 available (obviously) - it
> > seems not too excessive to assume that there might be another bucket,
> > independent of the DAG bucket where the state is stored - same bucket
> (and
> > dedicated connection id) could even be used to store state for multiple
> S3
> > dag bundles - each Dag bundle could have a dedicated object storing the
> > state. The metadata is not huge, so continuously reading and replacing it
> > should not be an issue.
> >
> >  What's nice about it - this single object could even **actually** use S3
> > versioning to keep historical state  - to optimize things and keep a log
> of
> > changes potentially.
> >
> > J.
> >
> > On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko
> <oniko...@amazon.com.invalid
> > >
> > wrote:
> >
> > > Hey folks,
> > >
> > > tl;dr I’d like to get some thoughts on a proposal to use DB Manager for
> > S3
> > > Dag Bundle versioning.
> > >
> > > The initial commit for S3 Dag Bundles was recently merged [1] but it
> > lacks
> > > Bundle versioning (since this isn’t trivial with something like S3,
> like
> > it
> > > is with Git). The proposed solution involves building a snapshot of the
> > S3
> > > bucket at the time each Bundle version is created, noting the version
> of
> > > all the objects in the bucket (using S3’s native bucket versioning
> > feature)
> > > and creating a manifest to store those versions and then giving that
> > whole
> > > manifest itself some unique id/version/uuid. These manifests now need
> to
> > be
> > > stored somewhere for future use/retrieval. The proposal is to use the
> > > Airflow database using the DB Manager feature. Other options include
> > using
> > > the local filesystem to store them (but this obviously wont work in
> > > Airflow’s distributed architecture) or the S3 bucket itself (but this
> > > requires write access to the bucket and we will always be at the mercy
> of
> > > the user accidentally deleting/modifying the manifests as they try to
> > > manage the lifecycle of their bucket, they should not need to be aware
> of
> > > or need to account for this metadata). So the Airflow DB works nicely
> as
> > a
> > > persistent and internally accessible location for this data.
> > >
> > > But I’m aware of the complexities of using the DB Manager and the
> > > discussion we had during the last dev call about providers vending DB
> > > tables (concerning migrations and ensuring smooth upgrades or
> downgrades
> > of
> > > the schema). So I wanted to reach out to see what folks thought. I have
> > > talked to Jed, the Bundle Master (tm), and we haven’t come up with
> > anything
> > > else that solves the problem as cleanly, so the DB Manager is still my
> > top
> > > choice. I think what we go with will pave the way for other Bundle
> > > providers of a similar type as well, so it's worth thinking deeply
> about
> > > this decision.
> > >
> > > Let me know what you think and thanks for your time!
> > >
> > > Cheers,
> > > Niko
> > >
> > > [1] https://github.com/apache/airflow/pull/46621
> > >
> >
>

Re: S3 Dag Bundle Versions and DB Manager

Reply via email to