One of the options would be to use a similar approach as terraform uses -
i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket than
DAG files. Since we know there must be an S3 available (obviously) - it
seems not too excessive to assume that there might be another bucket,
independent of the DAG bucket where the state is stored - same bucket (and
dedicated connection id) could even be used to store state for multiple S3
dag bundles - each Dag bundle could have a dedicated object storing the
state. The metadata is not huge, so continuously reading and replacing it
should not be an issue.

 What's nice about it - this single object could even **actually** use S3
versioning to keep historical state  - to optimize things and keep a log of
changes potentially.

J.

On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko <oniko...@amazon.com.invalid>
wrote:

> Hey folks,
>
> tl;dr I’d like to get some thoughts on a proposal to use DB Manager for S3
> Dag Bundle versioning.
>
> The initial commit for S3 Dag Bundles was recently merged [1] but it lacks
> Bundle versioning (since this isn’t trivial with something like S3, like it
> is with Git). The proposed solution involves building a snapshot of the S3
> bucket at the time each Bundle version is created, noting the version of
> all the objects in the bucket (using S3’s native bucket versioning feature)
> and creating a manifest to store those versions and then giving that whole
> manifest itself some unique id/version/uuid. These manifests now need to be
> stored somewhere for future use/retrieval. The proposal is to use the
> Airflow database using the DB Manager feature. Other options include using
> the local filesystem to store them (but this obviously wont work in
> Airflow’s distributed architecture) or the S3 bucket itself (but this
> requires write access to the bucket and we will always be at the mercy of
> the user accidentally deleting/modifying the manifests as they try to
> manage the lifecycle of their bucket, they should not need to be aware of
> or need to account for this metadata). So the Airflow DB works nicely as a
> persistent and internally accessible location for this data.
>
> But I’m aware of the complexities of using the DB Manager and the
> discussion we had during the last dev call about providers vending DB
> tables (concerning migrations and ensuring smooth upgrades or downgrades of
> the schema). So I wanted to reach out to see what folks thought. I have
> talked to Jed, the Bundle Master (tm), and we haven’t come up with anything
> else that solves the problem as cleanly, so the DB Manager is still my top
> choice. I think what we go with will pave the way for other Bundle
> providers of a similar type as well, so it's worth thinking deeply about
> this decision.
>
> Let me know what you think and thanks for your time!
>
> Cheers,
> Niko
>
> [1] https://github.com/apache/airflow/pull/46621
>

Reply via email to