simi opened a new pull request, #52002:
URL: https://github.com/apache/airflow/pull/52002
### Problem
Currently, when working with `ObjectStoragePath`, you need to redundantly
specify both the `conn_id` and the full URI including the protocol (scheme),
even though the protocol can be inferred from the connection itself.
For example, in two different environments:
```bash
# Local development
AIRFLOW_CONN_STORAGE='{
"conn_type": "objectstore",
"extra": {
"provider": "file",
"base_path": "/opt/airflow/storage"
}
}'
# Production
AIRFLOW_CONN_STORAGE='{
"conn_type": "objectstore",
"extra": {
"provider": "gcs",
"base_path": "rasa-as-test",
"project": "rasa-test-1",
"key_path": "/opt/airflow/config/keys/gcp.json"
}
}'
```
To construct an object path, you currently need to write:
```python
# Local
ObjectStoragePath("file://folder/file.txt", conn_id="storage")
# Production
ObjectStoragePath("gcs://bucket/file.txt", conn_id="storage")
```
Even though the `conn_id` is the same, the protocol part (`file://`,
`gcs://`) must be manually duplicated in the path string. This defeats the
purpose of having environment-specific connections and creates unnecessary
friction.
---
### Proposal
This PR introduces:
```python
ObjectStoragePath.from_conn(conn: Connection, path: str) -> ObjectStoragePath
```
This method allows you to construct a consistent object path across
environments by deferring to the connection metadata for `provider` and
`base_path`, which are used to build the full URI internally.
---
### Example
With this change, you can now write:
```python
ObjectStoragePath.from_conn(conn, "folder/file.txt")
```
...and the correct protocol and bucket/path will be resolved from the
`conn.extra` fields, like `provider` and `base_path`.
---
### Considerations
* Yes, the connection is resolved twice: once manually before passing to
`from_conn`, and again later during internal resolution via `attach` and the
I/O system. This isn't ideal, but it was the simplest way to ensure
`ObjectStoragePath` has fully defined properties (`protocol`, etc.) at
construction time. And it should be cached per my understanding. Maybe that's
not actually problem.
### Alternatives
I initially tried to implement it like `ObjectStoragePath("file.txt",
conn_id="storage")`, hoping the protocol would be inferred from the connection
at some early time. But this doesn't work — attributes like protocol remain
empty until .fs (not 100% sure which method does that) is accessed later, which
fills them in lazily.
This delayed resolution makes the object unpredictable. The `from_conn`
approach ensures all parts of the URI are set upfront using the connection
details, so the path is complete and deterministic from the start — no hidden
side effects or missing values.
---
### Feedback Welcome
I'm Airflow code beginner and I'm open to better designs if there's a
cleaner way to avoid the double resolution and still ensure proper
initialization. I intentionally kept the change minimal to fit cleanly into the
current behavior of `ObjectStoragePath` and the `airflow.io.*`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]