simi opened a new pull request, #52002:
URL: https://github.com/apache/airflow/pull/52002

   ### Problem
   
   Currently, when working with `ObjectStoragePath`, you need to redundantly 
specify both the `conn_id` and the full URI including the protocol (scheme), 
even though the protocol can be inferred from the connection itself.
   
   For example, in two different environments:
   
   ```bash
   # Local development
   AIRFLOW_CONN_STORAGE='{
     "conn_type": "objectstore",
     "extra": {
       "provider": "file",
       "base_path": "/opt/airflow/storage"
     }
   }'
   
   # Production
   AIRFLOW_CONN_STORAGE='{
     "conn_type": "objectstore",
     "extra": {
       "provider": "gcs",
       "base_path": "rasa-as-test",
       "project": "rasa-test-1",
       "key_path": "/opt/airflow/config/keys/gcp.json"
     }
   }'
   ```
   
   To construct an object path, you currently need to write:
   
   ```python
   # Local
   ObjectStoragePath("file://folder/file.txt", conn_id="storage")
   
   # Production
   ObjectStoragePath("gcs://bucket/file.txt", conn_id="storage")
   ```
   
   Even though the `conn_id` is the same, the protocol part (`file://`, 
`gcs://`) must be manually duplicated in the path string. This defeats the 
purpose of having environment-specific connections and creates unnecessary 
friction.
   
   ---
   
   ### Proposal
   
   This PR introduces:
   
   ```python
   ObjectStoragePath.from_conn(conn: Connection, path: str) -> ObjectStoragePath
   ```
   
   This method allows you to construct a consistent object path across 
environments by deferring to the connection metadata for `provider` and 
`base_path`, which are used to build the full URI internally.
   
   ---
   
   ### Example
   
   With this change, you can now write:
   
   ```python
   ObjectStoragePath.from_conn(conn, "folder/file.txt")
   ```
   
   ...and the correct protocol and bucket/path will be resolved from the 
`conn.extra` fields, like `provider` and `base_path`.
   
   ---
   
   ### Considerations
   
   * Yes, the connection is resolved twice: once manually before passing to 
`from_conn`, and again later during internal resolution via `attach` and the 
I/O system. This isn't ideal, but it was the simplest way to ensure 
`ObjectStoragePath` has fully defined properties (`protocol`, etc.) at 
construction time. And it should be cached per my understanding. Maybe that's 
not actually problem.
   
   ### Alternatives
   
   I initially tried to implement it like `ObjectStoragePath("file.txt", 
conn_id="storage")`, hoping the protocol would be inferred from the connection 
at some early time. But this doesn't work — attributes like protocol remain 
empty until .fs (not 100% sure which method does that) is accessed later, which 
fills them in lazily.
   
   This delayed resolution makes the object unpredictable. The `from_conn` 
approach ensures all parts of the URI are set upfront using the connection 
details, so the path is complete and deterministic from the start — no hidden 
side effects or missing values.
   
   ---
   
   ### Feedback Welcome
   
   I'm Airflow code beginner and I'm open to better designs if there's a 
cleaner way to avoid the double resolution and still ensure proper 
initialization. I intentionally kept the change minimal to fit cleanly into the 
current behavior of `ObjectStoragePath` and the `airflow.io.*`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to