Re: [DISCUSS] Best practices for initializing `ObjectStoragePath`

Kevin Yang Thu, 07 Aug 2025 06:04:42 -0700

Maybe join this conversation late, but I would like to share/document a pattern 
and hope which can provide more references for the discussion.


In terms of deployment model, there are isolated environments, each with their 
own deployment of Airflow instance and the required services.

DEV Environment > DEV Airflow + other DEV infra
PROD Environment > PROD Airflow + other PROD infra

The secrets or configuration usually are already configured to be accessible 
from somewhere (e.g. environment variables, vault service) with respect to each 
environment. The code is basically the same but it fetches configuration data 
or secrets from the corresponding environment. For example,

DEV Environment > “S3 bucket” > s3://dev-bucket
PROD Environment > “S3 bucket” > s3://prod-bucket

Under this deployment model, environment specific configuration may not be 
configured at operator.

Hope this help,
Kevin Yang

Sent from Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Josef Šimánek <josef.sima...@gmail.com>
Sent: Monday, August 4, 2025 6:57:08 AM
To: dev@airflow.apache.org <dev@airflow.apache.org>
Cc: Bolke de Bruin <bo...@apache.org>
Subject: Re: [DISCUSS] Best practices for initializing `ObjectStoragePath`

po 4. 8. 2025 v 12:42 odesílatel Bolke de Bruin <bo...@apache.org> napsal:
>
>  Josef is proposing to make ObjectStoragePath construction
> environment-agnostic by storing provider and base path
> in Airflow connections. So you just need to change the connection
> configuration.

I'm trying to make it actually provider agnostic, not environment
agnostic. IMHO it is not possible to simply construct
ObjectStoragePath just from connection without specifying the protocol
(gcs, s3, file...). I need to store at least 2 parts in ENV:

- connection itself (like gcs with key paths)
- protocol + bucket/base path (like "gcs://my-test" in Airflow
variable for example)

Buckets are usually different per environment (since GCP bucket names
are unique across the whole platform). In some environments it could
be handy to also use different connections/protocol like sometimes it
is more friendly to use file locally (for development/debugging
purpose) and s3 on a deployed environment.

I would prefer to make the whole setup for ObjectStorgePath
construction through one configuration (possible to be passed in an
environment variable).

> This makes indeed a tighter coupling and makes me wonder about the
> deployment model. While creating ObjectStoragePath I
> had standard CI/CD practices in mind, where these things typically get set
> through environment variables. This does
> assume a deployment-wide setting obviously and is not runtime selectable.
> So to understand the case better I like to
> know more about the need for runtime selection.

I do setup connections and variables through environment variables and
secrets in K8s deployment to 2 environments (test, production).
Locally I'm using docker-compose.yml with .env file for development.

> Care to clarify?
>
> Cheers
> Bolke
>
>
>
>
> On Sun, 3 Aug 2025 at 22:28, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > Bolke (or others) - maybe you can add something here and (re) ignite the
> > discussion ?
> >
> > On Tue, Jul 22, 2025 at 8:40 PM Josef Šimánek <josef.sima...@gmail.com>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > I've been experimenting with `ObjectStoragePath` and recently opened a
> > > [PR](https://github.com/apache/airflow/pull/52002) aiming to simplify
> > > its construction using Airflow connections — especially in cases where
> > > environments (e.g., dev, staging, prod) differ primarily in object
> > > storage provider (e.g., S3, GCS, file) and base path.
> > >
> > > The goal was to construct a reusable root path from a connection like
> > this:
> > >
> > > ```python
> > > storage = ObjectStoragePath.from_conn(BaseHook.get_connection("storage"))
> > > object = storage / "my_file.txt"
> > > ```
> > >
> > > ...without needing to hardcode schemes like `s3://` or `gs://` and
> > > base paths (usually "buckets") into the DAG code. The idea was to
> > > infer provider and base path from connection `extra` fields (e.g.,
> > > `provider`, `base_path`), allowing the same DAG code to work across
> > > environments by simply reconfiguring the connection.
> > >
> > > The PR sparked a great discussion (linked above), and I realized this
> > > might be a good opportunity to collect **broader community
> > > experience** around the use of `ObjectStoragePath` and object storage
> > > in general.
> > >
> > > A few questions I'd like to raise:
> > >
> > > * How are you configuring access to object storage across environments?
> > > * Do you find it useful to extract `scheme` and `base_path` from
> > > connections (or any other configuration)?
> > > * Are there existing best practices or patterns for making
> > > `ObjectStoragePath` construction generic and environment-agnostic?
> > > * Would it make sense to define a common utility or convention (e.g.
> > > via extras, `get_fs`, provider's `filesystems`, or a connection
> > > helper)?
> > >
> > > I’m primarily looking for the best pattern—if any exists—or hoping we
> > > can come together to define and document one as a community.
> > >
> > > Best regards,
> > > Josef Šimánek (https://github.com/simi)
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > For additional commands, e-mail: dev-h...@airflow.apache.org
> > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [DISCUSS] Best practices for initializing `ObjectStoragePath`

Reply via email to