stegololz commented on code in PR #68022:
URL: https://github.com/apache/airflow/pull/68022#discussion_r3408056503
##########
providers/apache/hdfs/src/airflow/providers/apache/hdfs/assets/hdfs.py:
##########
@@ -26,10 +27,12 @@
from airflow.providers.common.compat.openlineage.facet import Dataset as
OpenLineageDataset
+# Preserve the empty-authority "hdfs:///path" (fs.defaultFS) form through
urlunsplit, like "file".
+if "hdfs" not in urllib.parse.uses_netloc:
+ urllib.parse.uses_netloc.append("hdfs")
Review Comment:
I can drop the line or implement a similar behaviour somewhere else, it is
not mandatory.
The SDK normalizes asset URIs through urlunsplit here:
https://github.com/apache/airflow/blob/e5bf1e30c1bf7b5ccf14f2f4cd611458f5bb6891/task-sdk/src/airflow/sdk/definitions/asset/__init__.py#L183-L185
That makes `hdfs:///apps/x` normalize to `hdfs:/apps/x`. I'd like to keep
the canonical `hdfs:///` form, which is why the if is there.
Caveat: as long as I stay on the provider, I won't be able to make a
distinction between `hdfs:/path` and `hdfs:///path` (both end up as
`hdfs:///path`), because the normalizer only sees the parsed result where both
already have an empty netloc. To keep them distinct I'd have to also modify the
Task SDK.
What would be the preference here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]