stegololz commented on PR #68022:
URL: https://github.com/apache/airflow/pull/68022#issuecomment-4624163911

   I fixed the tests but i'm not happy with the current implementation.
   
   The small caveat we have now that you want to know about: when the input is 
`hdfs:///path`, the asset is now stored as `hdfs:/path`. This is a consequence 
of how `urllib.parse` represents and serializes URIs, not a deliberate design 
choice on my part:
   
   ```python
   >>> from urllib.parse import urlsplit, urlunsplit
   >>> urlsplit("hdfs:///apps/x") == urlsplit("hdfs:/apps/x")
   True
   >>> urlunsplit(urlsplit("hdfs:///apps/x"))
   'hdfs:/apps/x'
   ```
   urlsplit cannot distinguish the two forms: they parse to identical 
SplitResult instances with an empty netloc and urlunsplit cannot emit // for an 
empty authority; it omits the // entirely. Since the provider hook only sees 
and returns a SplitResult, and the surrounding _sanitize_uri in task-sdk calls 
urlunsplit on the result, there is no way for the provider on its own to 
preserve the /// form in the stored URI.
   
   Round-trip identity is still stable: both hdfs:///apps/x and hdfs:/apps/x 
normalize to hdfs:/apps/x, so asset matching remains consistent across 
re-parses. For the OpenLineage conversion, an empty netloc produces a hdfs:// 
namespace, which is what consumers expect for fs.defaultFS-resolved paths.
   
   If preserving the literal hdfs:///path form in storage is considered worth 
the additional surface area, it could be done with a small opt-in in 
_sanitize_uri (for example, a normalizer.preserve_empty_authority = True 
attribute set when we  want the empty-authority form retained, and a 
corresponding string-level fix-up after urlunsplit). That keeps the behavior 
change scoped to opt-in normalizers (only HDFS would set it today), and leaves 
others untouched.
   
   I deliberately left this out of the current PR to keep the change minimal 
and provider-local. Happy to follow up with that task-sdk change if someone 
thinks it is worth.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to