stegololz commented on code in PR #68022:
URL: https://github.com/apache/airflow/pull/68022#discussion_r3408056503
##########
providers/apache/hdfs/src/airflow/providers/apache/hdfs/assets/hdfs.py:
##########
@@ -26,10 +27,12 @@
from airflow.providers.common.compat.openlineage.facet import Dataset as
OpenLineageDataset
+# Preserve the empty-authority "hdfs:///path" (fs.defaultFS) form through
urlunsplit, like "file".
+if "hdfs" not in urllib.parse.uses_netloc:
+ urllib.parse.uses_netloc.append("hdfs")
Review Comment:
I can drop the line or implement a similar behaviour somewhere else, it is
not mandatory.
The SDK normalizes asset URIs through urlunsplit here:
https://github.com/apache/airflow/blob/e5bf1e30c1bf7b5ccf14f2f4cd611458f5bb6891/task-sdk/src/airflow/sdk/definitions/asset/__init__.py#L183-L185
That makes hdfs:///apps/x normalize to hdfs:/apps/x. I'd like to keep the
canonical hdfs:/// form, which is why the if is there.
Caveat: as long as I stay on the provider, I won't be able to make a
distinction between hdfs:/path and hdfs:///path (both end up as hdfs:///path),
because the normalizer only sees the parsed result where both already have an
empty netloc. To keep them distinct I'd have to also modify the Task SDK.
What would be the preference here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]