Agree another s3 bucket also works here

Another option also would be Using dynamodb table? that also supports
snapshots and i feel it works very well with state management.


Pavan

On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> One of the options would be to use a similar approach as terraform uses -
> i.e. use dedicated "metadata" state storage in a DIFFERENT s3 bucket than
> DAG files. Since we know there must be an S3 available (obviously) - it
> seems not too excessive to assume that there might be another bucket,
> independent of the DAG bucket where the state is stored - same bucket (and
> dedicated connection id) could even be used to store state for multiple S3
> dag bundles - each Dag bundle could have a dedicated object storing the
> state. The metadata is not huge, so continuously reading and replacing it
> should not be an issue.
>
>  What's nice about it - this single object could even **actually** use S3
> versioning to keep historical state  - to optimize things and keep a log of
> changes potentially.
>
> J.
>
> On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko <oniko...@amazon.com.invalid
> >
> wrote:
>
> > Hey folks,
> >
> > tl;dr I’d like to get some thoughts on a proposal to use DB Manager for
> S3
> > Dag Bundle versioning.
> >
> > The initial commit for S3 Dag Bundles was recently merged [1] but it
> lacks
> > Bundle versioning (since this isn’t trivial with something like S3, like
> it
> > is with Git). The proposed solution involves building a snapshot of the
> S3
> > bucket at the time each Bundle version is created, noting the version of
> > all the objects in the bucket (using S3’s native bucket versioning
> feature)
> > and creating a manifest to store those versions and then giving that
> whole
> > manifest itself some unique id/version/uuid. These manifests now need to
> be
> > stored somewhere for future use/retrieval. The proposal is to use the
> > Airflow database using the DB Manager feature. Other options include
> using
> > the local filesystem to store them (but this obviously wont work in
> > Airflow’s distributed architecture) or the S3 bucket itself (but this
> > requires write access to the bucket and we will always be at the mercy of
> > the user accidentally deleting/modifying the manifests as they try to
> > manage the lifecycle of their bucket, they should not need to be aware of
> > or need to account for this metadata). So the Airflow DB works nicely as
> a
> > persistent and internally accessible location for this data.
> >
> > But I’m aware of the complexities of using the DB Manager and the
> > discussion we had during the last dev call about providers vending DB
> > tables (concerning migrations and ensuring smooth upgrades or downgrades
> of
> > the schema). So I wanted to reach out to see what folks thought. I have
> > talked to Jed, the Bundle Master (tm), and we haven’t come up with
> anything
> > else that solves the problem as cleanly, so the DB Manager is still my
> top
> > choice. I think what we go with will pave the way for other Bundle
> > providers of a similar type as well, so it's worth thinking deeply about
> > this decision.
> >
> > Let me know what you think and thanks for your time!
> >
> > Cheers,
> > Niko
> >
> > [1] https://github.com/apache/airflow/pull/46621
> >
>

Reply via email to