Re: S3 Dag Bundle Versions and DB Manager

Zhe You Liu Thu, 17 Jul 2025 20:55:26 -0700

Sorry for the late response.

Both approaches work for me; I just wanted to share my opinion as we settle on 
a final decision.


>From my perspective, the DagBundle acts as a client that pulls external state 
>and stores only the version identifier in the Airflow metadata DB.

For example, with GitDagBundle, the Git repository serves as the external 
storage. The GitDagBundle pulls DAG files locally and stores the commit hash as 
the `version` field in `DagBundleModel.version`.

1. If we choose to store the manifest in the Airflow metadata DB:

In my opinion, we can simply add an optional `manifest` field (or another 
suitable name). I don’t think we need to introduce a new table via DbManager; 
an additional field for storing metadata about the external state (such as 
prefix and object versions for all dags in the bundle, in the case of 
S3DagBundle) should suffice. We could introduce a new parent subclass, such as 
`RemoteDagBundle` or `ObjectStoreDagBundle`, in the common provider to define 
the structure for serializing and deserializing the `manifest` field.

2. If we decide to store the manifest outside the Airflow metadata DB:

We will need to clarify:

a) The required parameters for all DagBundles that pull DAGs from object 
storage. Based on the discussion above, we would need the `conn_id`, `bucket`, 
and `prefix` for the manifest file.

b) The interface for calculating the bundle version based on the external state 
or DAG content hash.

Here is a concrete example of how the manifest could be stored:
https://github.com/apache/airflow/pull/46621#issuecomment-3078208467

Thank you all for the insightful discussion!

Best,
Jason

On 2025/07/10 21:56:31 "Oliveira, Niko" wrote:
> Thanks for the reply Jarek :)
> 
> Indeed we have different philosophies about this so we will certainly keep 
> going in circles about where to draw the line on making things easy and 
> enjoyable to use, whether to intentionally add friction or not, etc, etc.
> 
> I think if we have optional paths to take and it's not immensely harder we 
> should err on the side of making OSS Airflow as good as it can be, despite 
> whatever managed services we have in the community. I'm not sure where it has 
> come from recently but this new push to make Airflow intentionally hard to 
> use so that managed services stay in business is a bit unsettling. We're 
> certainly not asking for that, and those around that I've chatted to (since 
> I'm now seeing this mentioned frequently) are also not asking for this. I'm 
> curious where this new pressure is coming from and why you feel it recently.
> 
> But regardless of the curiosity above, I'll return to the drawing board, and 
> see what else can be done for this particular problem. If there are other 
> Bundle types who need to solve the same problem perhaps we can find a more 
> acceptable implementation in Airflow core to support this. And if not, I'll 
> proceed with externalizing the storage of the S3 Bundle version metadata 
> outside of Airflow.
> 
> Cheers,
> Niko
> 
> ________________________________
> From: Jarek Potiuk <ja...@potiuk.com>
> Sent: Wednesday, July 9, 2025 11:59:06 PM
> To: dev@airflow.apache.org
> Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
> cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
> confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
> contenu ne présente aucun risque.
> 
> 
> 
> > To me, I'm always working from a user perspective. My goal is to make
> their lives easier, their deployments easier, the product the most
> enjoyable for them to use. To me, the best user experience is that they
> should enable bundle versioning and it should just work with as little or
> no extra steps and with as little infra as possible, and with the fewest
> possible pit falls for them to fall into. From a user perspective, they've
> already provisioned a database for airflow metadata, why is this portion of
> metadata leaking out to other forms of external storage? Now this is
> another resource they now need to be aware of and manage the lifecycle of
> (or allow us write access into their accounts to manage for them).
> 
> 
> *TL;DR; I think our goal in open-source is to have frictionless and "out of
> the box" experience only for basic cases, but not for more complex
> deployments.*
> 
> It's a long read if you want to read it .. so beware :).
> 
> I think that is an important "optimization goal" for sure to provide
> frictionless and enjoyable experience - but I think it's one of many goals
> that are sometimes contradicting with long term open-source project
> sustainability and it's very import to clarify which "user" we are talking
> about.
> 
> To be honest, I am not sure that our goal should be "airflow should work
> out of the box in case of integration with external services in production'
> if it complicates our code and makes it service-dependent  - and as Jens
> noticed, if we can come up with a "generic" thing that can be reusable
> across multiple services, we can invest more in making it works "out of the
> box", but if you anyhow need to integrate and make work with external
> service, it adds very little "deployment complexity" to use another piece
> of the service - and this is basically the job of deployment manager
> anyway.
> 
> The "just work" goal as I see it should only cover those individual users
> who want to try and use airflow in it's basic form and "standalone"
> configuration - not for "deployment managers".
> 
> I think yes - our goal should be to make things extremely easy for users
> who want to use airflow in its basic form where things should **just
> work**. Like "docker run -it apache/airflow standalone" - this is what
> currently **just works**, 0 configuration, 0 work for external
> integrations, and we even had a discussion that we could make it "low
> production ready" (which I think we could - just implement automated
> backup/recovery of sqlite db and maybe document mounting a folder with DAGs
> and db, better handling of logs rather than putting them as mixed output on
> stdout and we are practically done). But when you add "S3" as the dag
> storage you already need to make a lot of decisions - mostly about service
> accounts, security, access, versioning, backup of the s3 objects, etc. etc.
> And that's not a "standalone user' case - that is a "deployment manager"
> work (where "deployment manager" is a role - not necessarily title of the
> job you have.
> 
> I think - and that is a bit of philosophical - but I've been talking about
> it to Maciek Obuchowski yesterday - that there is a pretty clear boundary
> of what open-source solutions delivers and it should match expectations of
> people using it. Maintainers and community developing open-source should
> mostly deliver a working, generic solutions that are extendable with
> various deployment options and we should make it possible for those
> deployments to happen - and provide building blocks for them. But it's
> "deployment manager" work to make sure to put things together and make it
> works. And we should not do it "for them". It's their job to figure out how
> to configure and set-up things, make backups, set security boundaries etc.
> - we should make it possible, document the options, document security model
> and make it "easy" to configure things - but there should not be an
> expectation from the deploiyment manager that it "just works".
> 
> And I think your approach is perfectly fine - but only for "managed
> services" - there, indeed manage service user's expectations can be that
> things "just work" and they are willing to pay for it with real money,
> rather than their time and effort to make it so. And there I think, those
> who deliver such a service should have the "just work" as primary goal -
> also because users will have such expectations - because they actually pay
> for it to "just work". Not so much for open-source product - where "just
> work" often involves complexity, additional maintenance overhead and making
> opinionated decisions on "how it just works". For those "managed service"
> teams - "just work" is very much a primary goal.  But for "open source
> community" - having such a goal is  actually not good - it's dangerous
> because it might result in wrong expectations from the users. If we start
> making airflow "just works" in all kinds of deployment with zero work from
> the users who want to deploy it in production and at scale, they will
> expect it to happen for everything - why don't we have automated log
> trimming, why don't we have automated backup of the Database, why don't we
> auto vacuum the db, why don't we provide one-click deployment option on
> AWS. GCS. Azure, why don't we provide DDOS protection in our webserver, why
> don't we ..... you name it.
> 
> That's a bit of philosophy - those are the same assumptions and goals that
> I had in mind when designing multi-team - and there it's also why we had
> different views - I just feel that some level of friction is a "property"
> of open-source product.
> 
> Also a bit of "business" side - this is also "good" for those who provide
> managed services and airflow to keep sustainable open-source business model
> working - because what people are paying them is precisely to "remove the
> friction".  If take the "frictionless user experience" goal case to extreme
> - Airflow would essentially be killed IMHO. Imagine if Airflow would be
> frictioness for all kinds of deployments and had "everything" working out
> of the box. There would be no business for any of the managed services
> (because users would not need to pay for it). Then we would only have users
> who expect thigns to "just work" and most of them would not even think
> about contributing back. And there would be no managed services people
> (like you)  whose job is paid by the services - or people like me who work
> with and get money from several of those - which would basically slow down
> active development and maintenance for Airflow to a halt - because even if
> we had a lot of people willing to contribute, maintainers would have very
> little - own - time to keep things running. There is a fine balance that we
> keep now between the open-source and stakeholders, and open-source product
> "friction" is an important property that the balance is built on.
> 
> J.
> 
> 
> On Wed, Jul 9, 2025 at 9:21 PM Oliveira, Niko <oniko...@amazon.com.invalid>
> wrote:
> 
> > To me, I'm always working from a user perspective. My goal is to make
> > their lives easier, their deployments easier, the product the most
> > enjoyable for them to use. To me, the best user experience is that they
> > should enable bundle versioning and it should just work with as little or
> > no extra steps and with as little infra as possible, and with the fewest
> > possible pit falls for them to fall into. From a user perspective, they've
> > already provisioned a database for airflow metadata, why is this portion of
> > metadata leaking out to other forms of external storage? Now this is
> > another resource they now need to be aware of and manage the lifecycle of
> > (or allow us write access into their accounts to manage for them).
> >
> > Ultimately, we should not be afraid of doing sometimes difficult work to
> > make a good product for our users, it's for them in the end :)
> >
> > However, I see your perspectives as well, making our code and DB
> > management more complex is more work and complication for us. And from the
> > feedback so far I'm out voted, so I'm happy as always to disagree and
> > commit, and do as you wish :)
> >
> > Thanks for the feedback everyone!
> >
> > Cheers,
> > Niko
> >
> > ________________________________
> > From: Jens Scheffler <j_scheff...@gmx.de.INVALID>
> > Sent: Wednesday, July 9, 2025 12:07:08 PM
> > To: dev@airflow.apache.org
> > Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and know
> > the content is safe.
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> > le contenu ne présente aucun risque.
> >
> >
> >
> > My 2ct on the discussions are similar like the opinions before.
> >
> >  From my Edge3 experience migrating DB from provider - even if
> > technically enabled - is a bit of a pain. Adding a lot of boilerplate,
> > you need to consider your provider should also still be compatible with
> > AF2 (I assume) and once a user wants to downgrade it is a bit of manual
> > effort to downgrade DB as well.
> >
> > As long as we are not adding a generic Key/Value store to core (similar
> > liek Variables but for general purpose internal use not exposed to users
> > - but then in case of trougbleshooting how to "manage/admin it?) I would
> > also see it like Terraform - a secondary bucked for state os cheap and
> > convenient. Yes write access would be needed but only for Airflow. And
> > as it is separated from other should not be a general security harm...
> > just a small deployment complexity. And I assume versining is optional.
> > So no requirement to have it on per default and if a user wants to move
> > to/enable versioing then just the state bucket would need to be added to
> > Bundle-config?
> >
> > TLDR I would favor a bucket, else if DB is the choice then a common
> > solution in core might be easier than a DB handling in provider. But
> > would also not block any other, just from point of complexity I'd not
> > favor provider specifc DB tables.
> >
> > Jens
> >
> > On 09.07.25 19:57, Jarek Potiuk wrote:
> > > What about the DynamoDB idea ? What you are trying to trade-off is
> > "writing
> > > to airflow metadata DB" with "writing to another DB" really. So yes it
> > is -
> > > another thing you will need to have access to write to - other than
> > Airflow
> > > DB, but it's really the question should the boundaries be on "Everything
> > > writable should be in Airflow" vs. "Everything writable should be in the
> > > "cloud" that the integration is about.
> > >
> > > Yes - it makes the management using S3 versioning a bit more "write-y" -
> > > but on the other hand it does allow to confine complexity to a pure
> > > "amazon" provider  - with practically 0 impact on Airflow core and
> > airflow
> > > DB. Which I really like to be honest.
> > >
> > > And yes "co-location" is also my goal. And I think this is a perfect way
> > to
> > > explain it as well why it is better to keep "S3 versioning" close to "S3"
> > > and not to Airflow - especially that there will be a lot of "S3-specific"
> > > things in the state that are not easy to abstract and have "common" for
> > > other Airflow versioning implementations.
> > >
> > > You can think about it this way:
> > >
> > > Airflow has already done its job with abstractions - versioning changes
> > and
> > > metadata DB is implemented in Airflow DB. If there are any missing pieces
> > > in the abstraction that will be usable across multiple implementations of
> > > versioning, we should - of course - add it to Airflow metadata DB - in
> > the
> > > way that they can be used by those different implementations. But the
> > code
> > > to manage and use it should be in airflow-core.
> > > If there is anything specific for the implementation of S3 / Amazon
> > > integration -> it should be implemented independently from Airflow
> > Metadata
> > > DB. There are many complexities in managing and upgrading core DB and we
> > > should not use the db to make provider-specific things. The discussion
> > > about shared code and isolation is interesting in this context. Because I
> > > think we might get to the point when we go deeper and deeper in this
> > > direction that we will have (and we already do it more or less) NO
> > > (regular) providers needed with whatever CLI or tooling we will need to
> > > manage the Metadata DB. FAB and Edge are currently exceptions - but they
> > > are by no means "regular" providers.
> > >
> > > So I'd say - if while designing/ implementing S3 versioning you will see
> > > that part of the implementation can be abstracted away and added to the
> > > core and used by other implementations - 100% - let's add it to the core.
> > > But only then. If it is something that only Amazon provider needs and S3
> > > needs - let's make it use Amazon **whatever** as backing storage.
> > >
> > > I would even say - talk to the Google team and try to come up with an
> > > abstraction that can be used for versioning in both S3 and GCS, agree on
> > > it, and let's see if this abstraction should find its way to the core.
> > That
> > > would be my proposal.
> > >
> > > J.
> > >
> > >
> > >
> > >
> > > On Wed, Jul 9, 2025 at 7:37 PM Oliveira, Niko
> > <oniko...@amazon.com.invalid>
> > > wrote:
> > >
> > >> Thanks for engaging folks!
> > >>
> > >> I don’t love the idea of using another bucket. For one, this means
> > Airflow
> > >> needs write access to S3 which is not ideal; some users/customers are
> > very
> > >> sensitive about ever allowing write access to things. And two, you will
> > >> commonly get issues with a design that leaks state into customer managed
> > >> accounts/resources, they may delete the bucket not knowing what it is,
> > they
> > >> may not migrate it to a new account or region if they ever move. I think
> > >> it’s best for the data to be stored transparently to the user and
> > >> co-located with the data it strongly relates to (i.e. the dag runs that
> > are
> > >> associated with those bundle versions).
> > >>
> > >> Is using DB Manager completely unacceptable these days? What are folks'
> > >> thoughts on that?
> > >>
> > >> Cheers,
> > >> Niko
> > >>
> > >> ________________________________
> > >> From: Jarek Potiuk <ja...@potiuk.com>
> > >> Sent: Wednesday, July 9, 2025 6:23:54 AM
> > >> To: dev@airflow.apache.org
> > >> Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
> > >>
> > >> CAUTION: This email originated from outside of the organization. Do not
> > >> click links or open attachments unless you can confirm the sender and
> > know
> > >> the content is safe.
> > >>
> > >>
> > >>
> > >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > externe.
> > >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > >> le contenu ne présente aucun risque.
> > >>
> > >>
> > >>
> > >>> Another option also would be Using dynamodb table? that also supports
> > >> snapshots and i feel it works very well with state management.
> > >>
> > >> Yep that would also work.
> > >>
> > >> Anything "Amazon" to keep state would do. I think that it should be our
> > >> "default" approach that if we have to keep state and the state is
> > connected
> > >> with specific "provider's" implementation, it's best to not keep state
> > in
> > >> Airflow, but in the "integration" that the provider works with if
> > possible.
> > >> We cannot do it in "generic" case because we do not know what
> > >> "integrations" the user has - but since this is "provider's"
> > functionality,
> > >> using anything else that the given integration provides makes perfect
> > >> sense.
> > >>
> > >> J.
> > >>
> > >>
> > >> On Wed, Jul 9, 2025 at 3:12 PM Pavankumar Gopidesu <
> > >> gopidesupa...@gmail.com>
> > >> wrote:
> > >>
> > >>> Agree another s3 bucket also works here
> > >>>
> > >>> Another option also would be Using dynamodb table? that also supports
> > >>> snapshots and i feel it works very well with state management.
> > >>>
> > >>>
> > >>> Pavan
> > >>>
> > >>> On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> > >>>
> > >>>> One of the options would be to use a similar approach as terraform
> > >> uses -
> > >>>> i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket
> > >> than
> > >>>> DAG files. Since we know there must be an S3 available (obviously) -
> > it
> > >>>> seems not too excessive to assume that there might be another bucket,
> > >>>> independent of the DAG bucket where the state is stored - same bucket
> > >>> (and
> > >>>> dedicated connection id) could even be used to store state for
> > multiple
> > >>> S3
> > >>>> dag bundles - each Dag bundle could have a dedicated object storing
> > the
> > >>>> state. The metadata is not huge, so continuously reading and replacing
> > >> it
> > >>>> should not be an issue.
> > >>>>
> > >>>>   What's nice about it - this single object could even **actually**
> > use
> > >> S3
> > >>>> versioning to keep historical state  - to optimize things and keep a
> > >> log
> > >>> of
> > >>>> changes potentially.
> > >>>>
> > >>>> J.
> > >>>>
> > >>>> On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko
> > >>> <oniko...@amazon.com.invalid
> > >>>> wrote:
> > >>>>
> > >>>>> Hey folks,
> > >>>>>
> > >>>>> tl;dr I’d like to get some thoughts on a proposal to use DB Manager
> > >> for
> > >>>> S3
> > >>>>> Dag Bundle versioning.
> > >>>>>
> > >>>>> The initial commit for S3 Dag Bundles was recently merged [1] but it
> > >>>> lacks
> > >>>>> Bundle versioning (since this isn’t trivial with something like S3,
> > >>> like
> > >>>> it
> > >>>>> is with Git). The proposed solution involves building a snapshot of
> > >> the
> > >>>> S3
> > >>>>> bucket at the time each Bundle version is created, noting the version
> > >>> of
> > >>>>> all the objects in the bucket (using S3’s native bucket versioning
> > >>>> feature)
> > >>>>> and creating a manifest to store those versions and then giving that
> > >>>> whole
> > >>>>> manifest itself some unique id/version/uuid. These manifests now need
> > >>> to
> > >>>> be
> > >>>>> stored somewhere for future use/retrieval. The proposal is to use the
> > >>>>> Airflow database using the DB Manager feature. Other options include
> > >>>> using
> > >>>>> the local filesystem to store them (but this obviously wont work in
> > >>>>> Airflow’s distributed architecture) or the S3 bucket itself (but this
> > >>>>> requires write access to the bucket and we will always be at the
> > >> mercy
> > >>> of
> > >>>>> the user accidentally deleting/modifying the manifests as they try to
> > >>>>> manage the lifecycle of their bucket, they should not need to be
> > >> aware
> > >>> of
> > >>>>> or need to account for this metadata). So the Airflow DB works nicely
> > >>> as
> > >>>> a
> > >>>>> persistent and internally accessible location for this data.
> > >>>>>
> > >>>>> But I’m aware of the complexities of using the DB Manager and the
> > >>>>> discussion we had during the last dev call about providers vending DB
> > >>>>> tables (concerning migrations and ensuring smooth upgrades or
> > >>> downgrades
> > >>>> of
> > >>>>> the schema). So I wanted to reach out to see what folks thought. I
> > >> have
> > >>>>> talked to Jed, the Bundle Master (tm), and we haven’t come up with
> > >>>> anything
> > >>>>> else that solves the problem as cleanly, so the DB Manager is still
> > >> my
> > >>>> top
> > >>>>> choice. I think what we go with will pave the way for other Bundle
> > >>>>> providers of a similar type as well, so it's worth thinking deeply
> > >>> about
> > >>>>> this decision.
> > >>>>>
> > >>>>> Let me know what you think and thanks for your time!
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Niko
> > >>>>>
> > >>>>> [1] https://github.com/apache/airflow/pull/46621
> > >>>>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: S3 Dag Bundle Versions and DB Manager

Reply via email to