dstandish commented on code in PR #25972:
URL: https://github.com/apache/airflow/pull/25972#discussion_r955515944


##########
airflow/models/dataset.py:
##########
@@ -100,6 +98,55 @@ def __hash__(self):
     def __repr__(self):
         return f"{self.__class__.__name__}(uri={self.uri!r}, 
extra={self.extra!r})"
 
+    @property
+    def canonical_uri(self):
+        """
+        Resolve the canonical uri for a dataset.
+
+        If the uri doesn't have an `airflow` scheme, return it as-is.

Review Comment:
   just want to point out, this will leave a bit of a gap concerning 
canonicalization... given that for some services there is more than one way to 
represent a given resource.  not to mention the ambiguity in the spec itself.
   
   what to do about it?  well, we could lean on the "experimental" aspect of 
this and change it later if we decide to implement fuller canonicalization.  
alternatively, we could be a little more restrictive, and instead of adding 
this property (which advertises canonicalization) we could add a method like 
"render_airflow_uri" or something...



##########
airflow/models/dataset.py:
##########
@@ -100,6 +98,55 @@ def __hash__(self):
     def __repr__(self):
         return f"{self.__class__.__name__}(uri={self.uri!r}, 
extra={self.extra!r})"
 
+    @property
+    def canonical_uri(self):
+        """
+        Resolve the canonical uri for a dataset.
+
+        If the uri doesn't have an `airflow` scheme, return it as-is.
+
+        If it does have an `airflow` scheme, it takes the connection id from
+        the netloc. It then will combine the connection uri and dataset uri to
+        form the canonical uri. It does this by:
+
+        * Using the scheme from the connection, unless an override is provided
+          in the dataset scheme (e.g. airflow+override://)
+        * Combine the path, connection first followed by the dataset path
+        * Merge the query args
+
+        # airflow://conn_id/...
+        # airflow+override://conn_id/...
+        # airflow://conn_id/some_extra_path?query
+        """
+        parsed = urlparse(self.uri)
+
+        if not parsed.scheme.startswith("airflow"):
+            return self.uri
+
+        conn_id = parsed.hostname
+        conn = 
urlparse(Connection.get_connection_from_secrets(conn_id).get_uri())

Review Comment:
   i'm not sure this is a good idea.  there could be sensitive information in 
here.  a lot of it tends to deal with auth and config more so than the mere 
location of the data thing.  e.g. keyfile_dict?  it could be very long.  also 
you have problems with "canonicalizing" i.e. for a given airflow connection and 
resource there could be multiple ways to represent it e.g. password in user 
info block vs in extra and this kind of thing.  also, extra is nested json it 
will render everything under query param `__extra__` or something like that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to