simi commented on PR #52002:
URL: https://github.com/apache/airflow/pull/52002#issuecomment-3095546058

   @potiuk thanks for the review. I do agree on almost everything written. 
Currently I do host this as "utility" shared method (shared at the end of 
comment).
   
   > I don't think the pattern you explain is "generic" that "everyone" is 
using . It's very specific for your case, where you want to use different 
provider locally and for production. This is somethign your company chose to 
do, but this is not something that is very generic pattern.
   
   I'm trying to disconnect Airflow environment from DAGs. When 
`ObjectStoragePath` is acquired (and properly configured), this abstraction is 
already there and there is (almost) no difference if configured for S3 or AWS.
   
   I understand current approach is super brittle and provides almost no 
guarantees. Indeed **Connection** is too wide abstraction and doesn't currently 
couple with `ObjectStoragePath`. I have seen similar brittle coupling in other 
Airflow parts and considered that would be maybe ok in here also.
   
   Would it make sense to couple **Connection** and **Object Storage** somehow 
making more simple to check if given connection can initiate **Object Storage** 
path?
   
   Is there any best practice on configuring Object Storage initialization? 
Maybe we can just improve 
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/objectstorage.html
 with best patterns. I'm trying to avoid unnecessary branching as in this 
pseudo-code.
   
   ```
   path = 'my/path'
   ...
   if provider is aws
     base = ObjectStoragePath(f"s3://{path}", conn_id="connection")
   else if provider is gcs
       base = ObjectStoragePath(f"gcs://{path}", conn_id="connection")
   ```
   
   Would it make more sense for example try to extract "protocol" part for 
`fsspec` out of connection instead like 
`BaseHook.get_connection("storage").fsspec_protocol` and let user compose 
`ObjectStoragePath` that way?
   
   :thinking: Per my experience environments differs mostly by "bucket" on 
ObjectStoragePath (and in some of my cases also by protocol). IMHO it is worth 
to extract those two out of specified connection.
   
   I was thinking also about to store base path (including protocol) in 
Variable and the rest in Connection. But this will couple Variable and 
Connection together I will need to ensure is kept in sync on all environments 
(aka no guarantees). 
   
   ---
   
   ```python
   from airflow.sdk import ObjectStoragePath
   
   
   # TODO: check https://github.com/apache/airflow/pull/52002
   def path_from_conn(conn) -> "ObjectStoragePath":
       extra = conn.extra_dejson or {}
   
       provider = extra.get("provider")
       base_path = extra.get("base_path")
   
       if not provider or not base_path:
           raise ValueError(
               f"Missing 'provider' or 'base_path' in connection extras for 
conn_id={conn.conn_id}"
           )
   
       if provider not in fsspec.available_protocols():
           raise ValueError(
               f"Unknown provider '{provider}' for conn_id={conn.conn_id}. "
               f"Known providers: {fsspec.available_protocols()}"
           )
   
       uri = urlunsplit((provider, base_path, "", "", ""))
       return ObjectStoragePath(uri, conn_id=conn.conn_id)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to