stegololz commented on PR #68022: URL: https://github.com/apache/airflow/pull/68022#issuecomment-4698366391
Update after digging into normalization. Dropping the host check alone is not enough: the SDK normalizes asset URIs with `urllib.parse.urlunsplit`, which only keeps the `//` authority for schemes in `uses_netloc`. `hdfs` is not in that list, so `hdfs:///apps/x` was being silently rewritten to `hdfs:/apps/x`. Since asset URIs are primary keys and feed OpenLineage that silent rewrite is not acceptable. Current branch: Register `hdfs` in `uses_netloc` so `hdfs:///path` round-trips intact, like file already does. `hdfs://namenode:8020/path` is unaffected; In current behaviour `hdfs:/path` canonicalizes to `hdfs:///path`. Is this solution acceptable knowing that Hadoop treats them differently at the URI level (hdfs:/path and hdfs:///path *could* be distinct assets). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
