simi commented on PR #52002: URL: https://github.com/apache/airflow/pull/52002#issuecomment-3095546058
@potiuk thanks for the review. I do agree on almost everything written. Currently I do host this as "utility" shared method (shared at the end of comment). > I don't think the pattern you explain is "generic" that "everyone" is using . It's very specific for your case, where you want to use different provider locally and for production. This is somethign your company chose to do, but this is not something that is very generic pattern. I'm trying to disconnect Airflow environment from DAGs. When `ObjectStoragePath` is acquired (and properly configured), this abstraction is already there and there is (almost) no difference if configured for S3 or AWS. I understand current approach is super brittle and provides almost no guarantees. Indeed **Connection** is too wide abstraction and doesn't currently couple with `ObjectStoragePath`. I have seen similar brittle coupling in other Airflow parts and considered that would be maybe ok in here also. Would it make sense to couple **Connection** and **Object Storage** somehow making more simple to check if given connection can initiate **Object Storage** path? Is there any best practice on configuring Object Storage initialization? Maybe we can just improve https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/objectstorage.html with best patterns. I'm trying to avoid unnecessary branching as in this pseudo-code. ``` path = 'my/path' ... if provider is aws base = ObjectStoragePath(f"s3://{path}", conn_id="connection") else if provider is gcs base = ObjectStoragePath(f"gcs://{path}", conn_id="connection") ``` Would it make more sense for example try to extract "protocol" part for `fsspec` out of connection instead like `BaseHook.get_connection("storage").fsspec_protocol` and let user compose `ObjectStoragePath` that way? :thinking: Per my experience environments differs mostly by "bucket" on ObjectStoragePath (and in some of my cases also by protocol). IMHO it is worth to extract those two out of specified connection. I was thinking also about to store base path (including protocol) in Variable and the rest in Connection. But this will couple Variable and Connection together I will need to ensure is kept in sync on all environments (aka no guarantees). --- ```python from airflow.sdk import ObjectStoragePath # TODO: check https://github.com/apache/airflow/pull/52002 def path_from_conn(conn) -> "ObjectStoragePath": extra = conn.extra_dejson or {} provider = extra.get("provider") base_path = extra.get("base_path") if not provider or not base_path: raise ValueError( f"Missing 'provider' or 'base_path' in connection extras for conn_id={conn.conn_id}" ) if provider not in fsspec.available_protocols(): raise ValueError( f"Unknown provider '{provider}' for conn_id={conn.conn_id}. " f"Known providers: {fsspec.available_protocols()}" ) uri = urlunsplit((provider, base_path, "", "", "")) return ObjectStoragePath(uri, conn_id=conn.conn_id) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
