Re: S3 Dag Bundle Versions and DB Manager

Jarek Potiuk Wed, 09 Jul 2025 23:59:33 -0700

> To me, I'm always working from a user perspective. My goal is to make
their lives easier, their deployments easier, the product the most
enjoyable for them to use. To me, the best user experience is that they
should enable bundle versioning and it should just work with as little or
no extra steps and with as little infra as possible, and with the fewest
possible pit falls for them to fall into. From a user perspective, they've
already provisioned a database for airflow metadata, why is this portion of
metadata leaking out to other forms of external storage? Now this is
another resource they now need to be aware of and manage the lifecycle of
(or allow us write access into their accounts to manage for them).



*TL;DR; I think our goal in open-source is to have frictionless and "out of
the box" experience only for basic cases, but not for more complex
deployments.*

It's a long read if you want to read it .. so beware :).

I think that is an important "optimization goal" for sure to provide
frictionless and enjoyable experience - but I think it's one of many goals
that are sometimes contradicting with long term open-source project
sustainability and it's very import to clarify which "user" we are talking
about.

To be honest, I am not sure that our goal should be "airflow should work
out of the box in case of integration with external services in production'
if it complicates our code and makes it service-dependent  - and as Jens
noticed, if we can come up with a "generic" thing that can be reusable
across multiple services, we can invest more in making it works "out of the
box", but if you anyhow need to integrate and make work with external
service, it adds very little "deployment complexity" to use another piece
of the service - and this is basically the job of deployment manager
anyway.

The "just work" goal as I see it should only cover those individual users
who want to try and use airflow in it's basic form and "standalone"
configuration - not for "deployment managers".

I think yes - our goal should be to make things extremely easy for users
who want to use airflow in its basic form where things should **just
work**. Like "docker run -it apache/airflow standalone" - this is what
currently **just works**, 0 configuration, 0 work for external
integrations, and we even had a discussion that we could make it "low
production ready" (which I think we could - just implement automated
backup/recovery of sqlite db and maybe document mounting a folder with DAGs
and db, better handling of logs rather than putting them as mixed output on
stdout and we are practically done). But when you add "S3" as the dag
storage you already need to make a lot of decisions - mostly about service
accounts, security, access, versioning, backup of the s3 objects, etc. etc.
And that's not a "standalone user' case - that is a "deployment manager"
work (where "deployment manager" is a role - not necessarily title of the
job you have.

I think - and that is a bit of philosophical - but I've been talking about
it to Maciek Obuchowski yesterday - that there is a pretty clear boundary
of what open-source solutions delivers and it should match expectations of
people using it. Maintainers and community developing open-source should
mostly deliver a working, generic solutions that are extendable with
various deployment options and we should make it possible for those
deployments to happen - and provide building blocks for them. But it's
"deployment manager" work to make sure to put things together and make it
works. And we should not do it "for them". It's their job to figure out how
to configure and set-up things, make backups, set security boundaries etc.
- we should make it possible, document the options, document security model
and make it "easy" to configure things - but there should not be an
expectation from the deploiyment manager that it "just works".

And I think your approach is perfectly fine - but only for "managed
services" - there, indeed manage service user's expectations can be that
things "just work" and they are willing to pay for it with real money,
rather than their time and effort to make it so. And there I think, those
who deliver such a service should have the "just work" as primary goal -
also because users will have such expectations - because they actually pay
for it to "just work". Not so much for open-source product - where "just
work" often involves complexity, additional maintenance overhead and making
opinionated decisions on "how it just works". For those "managed service"
teams - "just work" is very much a primary goal.  But for "open source
community" - having such a goal is  actually not good - it's dangerous
because it might result in wrong expectations from the users. If we start
making airflow "just works" in all kinds of deployment with zero work from
the users who want to deploy it in production and at scale, they will
expect it to happen for everything - why don't we have automated log
trimming, why don't we have automated backup of the Database, why don't we
auto vacuum the db, why don't we provide one-click deployment option on
AWS. GCS. Azure, why don't we provide DDOS protection in our webserver, why
don't we ..... you name it.

That's a bit of philosophy - those are the same assumptions and goals that
I had in mind when designing multi-team - and there it's also why we had
different views - I just feel that some level of friction is a "property"
of open-source product.

Also a bit of "business" side - this is also "good" for those who provide
managed services and airflow to keep sustainable open-source business model
working - because what people are paying them is precisely to "remove the
friction".  If take the "frictionless user experience" goal case to extreme
- Airflow would essentially be killed IMHO. Imagine if Airflow would be
frictioness for all kinds of deployments and had "everything" working out
of the box. There would be no business for any of the managed services
(because users would not need to pay for it). Then we would only have users
who expect thigns to "just work" and most of them would not even think
about contributing back. And there would be no managed services people
(like you)  whose job is paid by the services - or people like me who work
with and get money from several of those - which would basically slow down
active development and maintenance for Airflow to a halt - because even if
we had a lot of people willing to contribute, maintainers would have very
little - own - time to keep things running. There is a fine balance that we
keep now between the open-source and stakeholders, and open-source product
"friction" is an important property that the balance is built on.

J.


On Wed, Jul 9, 2025 at 9:21 PM Oliveira, Niko <oniko...@amazon.com.invalid>
wrote:

> To me, I'm always working from a user perspective. My goal is to make
> their lives easier, their deployments easier, the product the most
> enjoyable for them to use. To me, the best user experience is that they
> should enable bundle versioning and it should just work with as little or
> no extra steps and with as little infra as possible, and with the fewest
> possible pit falls for them to fall into. From a user perspective, they've
> already provisioned a database for airflow metadata, why is this portion of
> metadata leaking out to other forms of external storage? Now this is
> another resource they now need to be aware of and manage the lifecycle of
> (or allow us write access into their accounts to manage for them).
>
> Ultimately, we should not be afraid of doing sometimes difficult work to
> make a good product for our users, it's for them in the end :)
>
> However, I see your perspectives as well, making our code and DB
> management more complex is more work and complication for us. And from the
> feedback so far I'm out voted, so I'm happy as always to disagree and
> commit, and do as you wish :)
>
> Thanks for the feedback everyone!
>
> Cheers,
> Niko
>
> ________________________________
> From: Jens Scheffler <j_scheff...@gmx.de.INVALID>
> Sent: Wednesday, July 9, 2025 12:07:08 PM
> To: dev@airflow.apache.org
> Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
> My 2ct on the discussions are similar like the opinions before.
>
>  From my Edge3 experience migrating DB from provider - even if
> technically enabled - is a bit of a pain. Adding a lot of boilerplate,
> you need to consider your provider should also still be compatible with
> AF2 (I assume) and once a user wants to downgrade it is a bit of manual
> effort to downgrade DB as well.
>
> As long as we are not adding a generic Key/Value store to core (similar
> liek Variables but for general purpose internal use not exposed to users
> - but then in case of trougbleshooting how to "manage/admin it?) I would
> also see it like Terraform - a secondary bucked for state os cheap and
> convenient. Yes write access would be needed but only for Airflow. And
> as it is separated from other should not be a general security harm...
> just a small deployment complexity. And I assume versining is optional.
> So no requirement to have it on per default and if a user wants to move
> to/enable versioing then just the state bucket would need to be added to
> Bundle-config?
>
> TLDR I would favor a bucket, else if DB is the choice then a common
> solution in core might be easier than a DB handling in provider. But
> would also not block any other, just from point of complexity I'd not
> favor provider specifc DB tables.
>
> Jens
>
> On 09.07.25 19:57, Jarek Potiuk wrote:
> > What about the DynamoDB idea ? What you are trying to trade-off is
> "writing
> > to airflow metadata DB" with "writing to another DB" really. So yes it
> is -
> > another thing you will need to have access to write to - other than
> Airflow
> > DB, but it's really the question should the boundaries be on "Everything
> > writable should be in Airflow" vs. "Everything writable should be in the
> > "cloud" that the integration is about.
> >
> > Yes - it makes the management using S3 versioning a bit more "write-y" -
> > but on the other hand it does allow to confine complexity to a pure
> > "amazon" provider  - with practically 0 impact on Airflow core and
> airflow
> > DB. Which I really like to be honest.
> >
> > And yes "co-location" is also my goal. And I think this is a perfect way
> to
> > explain it as well why it is better to keep "S3 versioning" close to "S3"
> > and not to Airflow - especially that there will be a lot of "S3-specific"
> > things in the state that are not easy to abstract and have "common" for
> > other Airflow versioning implementations.
> >
> > You can think about it this way:
> >
> > Airflow has already done its job with abstractions - versioning changes
> and
> > metadata DB is implemented in Airflow DB. If there are any missing pieces
> > in the abstraction that will be usable across multiple implementations of
> > versioning, we should - of course - add it to Airflow metadata DB - in
> the
> > way that they can be used by those different implementations. But the
> code
> > to manage and use it should be in airflow-core.
> > If there is anything specific for the implementation of S3 / Amazon
> > integration -> it should be implemented independently from Airflow
> Metadata
> > DB. There are many complexities in managing and upgrading core DB and we
> > should not use the db to make provider-specific things. The discussion
> > about shared code and isolation is interesting in this context. Because I
> > think we might get to the point when we go deeper and deeper in this
> > direction that we will have (and we already do it more or less) NO
> > (regular) providers needed with whatever CLI or tooling we will need to
> > manage the Metadata DB. FAB and Edge are currently exceptions - but they
> > are by no means "regular" providers.
> >
> > So I'd say - if while designing/ implementing S3 versioning you will see
> > that part of the implementation can be abstracted away and added to the
> > core and used by other implementations - 100% - let's add it to the core.
> > But only then. If it is something that only Amazon provider needs and S3
> > needs - let's make it use Amazon **whatever** as backing storage.
> >
> > I would even say - talk to the Google team and try to come up with an
> > abstraction that can be used for versioning in both S3 and GCS, agree on
> > it, and let's see if this abstraction should find its way to the core.
> That
> > would be my proposal.
> >
> > J.
> >
> >
> >
> >
> > On Wed, Jul 9, 2025 at 7:37 PM Oliveira, Niko
> <oniko...@amazon.com.invalid>
> > wrote:
> >
> >> Thanks for engaging folks!
> >>
> >> I don’t love the idea of using another bucket. For one, this means
> Airflow
> >> needs write access to S3 which is not ideal; some users/customers are
> very
> >> sensitive about ever allowing write access to things. And two, you will
> >> commonly get issues with a design that leaks state into customer managed
> >> accounts/resources, they may delete the bucket not knowing what it is,
> they
> >> may not migrate it to a new account or region if they ever move. I think
> >> it’s best for the data to be stored transparently to the user and
> >> co-located with the data it strongly relates to (i.e. the dag runs that
> are
> >> associated with those bundle versions).
> >>
> >> Is using DB Manager completely unacceptable these days? What are folks'
> >> thoughts on that?
> >>
> >> Cheers,
> >> Niko
> >>
> >> ________________________________
> >> From: Jarek Potiuk <ja...@potiuk.com>
> >> Sent: Wednesday, July 9, 2025 6:23:54 AM
> >> To: dev@airflow.apache.org
> >> Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
> >>
> >> CAUTION: This email originated from outside of the organization. Do not
> >> click links or open attachments unless you can confirm the sender and
> know
> >> the content is safe.
> >>
> >>
> >>
> >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> >> le contenu ne présente aucun risque.
> >>
> >>
> >>
> >>> Another option also would be Using dynamodb table? that also supports
> >> snapshots and i feel it works very well with state management.
> >>
> >> Yep that would also work.
> >>
> >> Anything "Amazon" to keep state would do. I think that it should be our
> >> "default" approach that if we have to keep state and the state is
> connected
> >> with specific "provider's" implementation, it's best to not keep state
> in
> >> Airflow, but in the "integration" that the provider works with if
> possible.
> >> We cannot do it in "generic" case because we do not know what
> >> "integrations" the user has - but since this is "provider's"
> functionality,
> >> using anything else that the given integration provides makes perfect
> >> sense.
> >>
> >> J.
> >>
> >>
> >> On Wed, Jul 9, 2025 at 3:12 PM Pavankumar Gopidesu <
> >> gopidesupa...@gmail.com>
> >> wrote:
> >>
> >>> Agree another s3 bucket also works here
> >>>
> >>> Another option also would be Using dynamodb table? that also supports
> >>> snapshots and i feel it works very well with state management.
> >>>
> >>>
> >>> Pavan
> >>>
> >>> On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>>> One of the options would be to use a similar approach as terraform
> >> uses -
> >>>> i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket
> >> than
> >>>> DAG files. Since we know there must be an S3 available (obviously) -
> it
> >>>> seems not too excessive to assume that there might be another bucket,
> >>>> independent of the DAG bucket where the state is stored - same bucket
> >>> (and
> >>>> dedicated connection id) could even be used to store state for
> multiple
> >>> S3
> >>>> dag bundles - each Dag bundle could have a dedicated object storing
> the
> >>>> state. The metadata is not huge, so continuously reading and replacing
> >> it
> >>>> should not be an issue.
> >>>>
> >>>>   What's nice about it - this single object could even **actually**
> use
> >> S3
> >>>> versioning to keep historical state  - to optimize things and keep a
> >> log
> >>> of
> >>>> changes potentially.
> >>>>
> >>>> J.
> >>>>
> >>>> On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko
> >>> <oniko...@amazon.com.invalid
> >>>> wrote:
> >>>>
> >>>>> Hey folks,
> >>>>>
> >>>>> tl;dr I’d like to get some thoughts on a proposal to use DB Manager
> >> for
> >>>> S3
> >>>>> Dag Bundle versioning.
> >>>>>
> >>>>> The initial commit for S3 Dag Bundles was recently merged [1] but it
> >>>> lacks
> >>>>> Bundle versioning (since this isn’t trivial with something like S3,
> >>> like
> >>>> it
> >>>>> is with Git). The proposed solution involves building a snapshot of
> >> the
> >>>> S3
> >>>>> bucket at the time each Bundle version is created, noting the version
> >>> of
> >>>>> all the objects in the bucket (using S3’s native bucket versioning
> >>>> feature)
> >>>>> and creating a manifest to store those versions and then giving that
> >>>> whole
> >>>>> manifest itself some unique id/version/uuid. These manifests now need
> >>> to
> >>>> be
> >>>>> stored somewhere for future use/retrieval. The proposal is to use the
> >>>>> Airflow database using the DB Manager feature. Other options include
> >>>> using
> >>>>> the local filesystem to store them (but this obviously wont work in
> >>>>> Airflow’s distributed architecture) or the S3 bucket itself (but this
> >>>>> requires write access to the bucket and we will always be at the
> >> mercy
> >>> of
> >>>>> the user accidentally deleting/modifying the manifests as they try to
> >>>>> manage the lifecycle of their bucket, they should not need to be
> >> aware
> >>> of
> >>>>> or need to account for this metadata). So the Airflow DB works nicely
> >>> as
> >>>> a
> >>>>> persistent and internally accessible location for this data.
> >>>>>
> >>>>> But I’m aware of the complexities of using the DB Manager and the
> >>>>> discussion we had during the last dev call about providers vending DB
> >>>>> tables (concerning migrations and ensuring smooth upgrades or
> >>> downgrades
> >>>> of
> >>>>> the schema). So I wanted to reach out to see what folks thought. I
> >> have
> >>>>> talked to Jed, the Bundle Master (tm), and we haven’t come up with
> >>>> anything
> >>>>> else that solves the problem as cleanly, so the DB Manager is still
> >> my
> >>>> top
> >>>>> choice. I think what we go with will pave the way for other Bundle
> >>>>> providers of a similar type as well, so it's worth thinking deeply
> >>> about
> >>>>> this decision.
> >>>>>
> >>>>> Let me know what you think and thanks for your time!
> >>>>>
> >>>>> Cheers,
> >>>>> Niko
> >>>>>
> >>>>> [1] https://github.com/apache/airflow/pull/46621
> >>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Re: S3 Dag Bundle Versions and DB Manager

Reply via email to