stegololz commented on PR #68022:
URL: https://github.com/apache/airflow/pull/68022#issuecomment-4624163911
I fixed the tests but i'm not happy with the current implementation.
The small caveat we have now that you want to know about: when the input is
`hdfs:///path`, the asset is now stored as `hdfs:/path`. This is a consequence
of how `urllib.parse` represents and serializes URIs, not a deliberate design
choice on my part:
```python
>>> from urllib.parse import urlsplit, urlunsplit
>>> urlsplit("hdfs:///apps/x") == urlsplit("hdfs:/apps/x")
True
>>> urlunsplit(urlsplit("hdfs:///apps/x"))
'hdfs:/apps/x'
```
urlsplit cannot distinguish the two forms: they parse to identical
SplitResult instances with an empty netloc and urlunsplit cannot emit // for an
empty authority; it omits the // entirely. Since the provider hook only sees
and returns a SplitResult, and the surrounding _sanitize_uri in task-sdk calls
urlunsplit on the result, there is no way for the provider on its own to
preserve the /// form in the stored URI.
Round-trip identity is still stable: both hdfs:///apps/x and hdfs:/apps/x
normalize to hdfs:/apps/x, so asset matching remains consistent across
re-parses. For the OpenLineage conversion, an empty netloc produces a hdfs://
namespace, which is what consumers expect for fs.defaultFS-resolved paths.
If preserving the literal hdfs:///path form in storage is considered worth
the additional surface area, it could be done with a small opt-in in
_sanitize_uri (for example, a normalizer.preserve_empty_authority = True
attribute set when we want the empty-authority form retained, and a
corresponding string-level fix-up after urlunsplit). That keeps the behavior
change scoped to opt-in normalizers (only HDFS would set it today), and leaves
others untouched.
I deliberately left this out of the current PR to keep the change minimal
and provider-local. Happy to follow up with that task-sdk change if someone
thinks it is worth.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]