stegololz commented on PR #68022:
URL: https://github.com/apache/airflow/pull/68022#issuecomment-4698366391

   Update after digging into normalization.
   
   Dropping the host check alone is not enough: the SDK normalizes asset URIs 
with `urllib.parse.urlunsplit`, which only keeps the `//` authority for schemes 
in `uses_netloc`. `hdfs` is not in that list, so `hdfs:///apps/x` was being 
silently rewritten to `hdfs:/apps/x`. Since asset URIs are primary keys and 
feed OpenLineage that silent rewrite is not acceptable.
   
   Current branch:
   
   Register `hdfs` in `uses_netloc` so `hdfs:///path` round-trips intact, like 
file already does.
   `hdfs://namenode:8020/path` is unaffected; 
   
   In current behaviour `hdfs:/path` canonicalizes to `hdfs:///path`. Is this 
solution acceptable knowing that Hadoop treats them differently at the URI 
level (hdfs:/path and hdfs:///path *could* be distinct assets).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to