stegololz opened a new pull request, #68022:
URL: https://github.com/apache/airflow/pull/68022
## Summary
Relax `airflow.providers.apache.hdfs.assets.hdfs.sanitize_uri` to accept the
canonical `hdfs:///path` form (empty netloc). Previously rejected with
`ValueError: URI format hdfs:// must contain a namenode host`.
## Why
- **RFC 3986**: the authority component of a URI is optional. `hdfs:///path`
is well-formed.
- **Hadoop semantics**: an empty authority means "resolve via `fs.defaultFS`
from `core-site.xml`". This is the standard idiom for portable
Spark/Hive/MapReduce jobs that must not hard-code a namenode — same shape as
`file:///etc/hosts`.
- The strict check was introduced in #66426 (alongside other new-scheme
sanitizers). It is more restrictive than the Hadoop convention and breaks any
DAG using `Asset("hdfs:///apps/x/file.parquet")` at parse time.
## Change
- `providers/apache/hdfs/.../assets/hdfs.py`: drop the "must contain a
namenode host" check; keep the path-required check.
- `providers/apache/hdfs/.../tests/.../test_hdfs.py`:
- Add positive cases for `hdfs:///apps/myapp/...` (empty netloc) — pass.
- Add negative case `hdfs://namenode:8020` (no path) — fail.
- Add `test_convert_asset_to_openlineage_default_fs` covering OpenLineage
emission with empty netloc.
`convert_asset_to_openlineage` already tolerates an empty netloc
(`f"hdfs://{parsed.netloc}"` yields `hdfs://` namespace), so no functional
change there.
## Related
- Introduced the strict check: #66426
- Sister fix for the mssql sanitizer: #67999
## Gen-AI disclosure
This PR was prepared with Gen-AI assistance (Claude). I reviewed all
generated code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]