parthchandra opened a new pull request, #4413:
URL: https://github.com/apache/datafusion-comet/pull/4413
## Which issue does this PR close?
Closes #4150.
## Rationale for this change
PR #4350 wired `parse_url` through Comet's serde layer to the upstream
`datafusion-spark` UDFs but marked it `Incompatible` due to divergences between
the WHATWG URL Standard (`url` crate) and Spark's `java.net.URI` (RFC 3986).
This PR replaces the upstream implementation with a local RFC 3986 regex-based
parser that matches Spark's behavior, promoting `parse_url` from
`Incompatible` to `Compatible`.
## What changes are included in this PR?
**New: `native/spark-expr/src/url_funcs/parse_url.rs`**
- `CometParseUrl` and `CometTryParseUrl` UDFs using [RFC 3986 Appendix
B](https://www.rfc-editor.org/rfc/rfc3986#appendix-B) regex decomposition
instead of the WHATWG `url` crate
- Fixes all tracked divergences from Spark:
1. Empty-string URL with PATH/FILE now returns `""` (was NULL)
2. FILE on URL without explicit path (`http://host?foo=bar`) returns
`?foo=bar` (was `/?foo=bar`)
3. PATH on trailing slash (`http://host/`) returns `"/"` (was `""`)
4. 3-arg QUERY with key returns raw percent-encoded value (was decoded)
- Additional correctness fixes found during adversarial review:
- Query values containing `=` (e.g., `?a=b=c`) now correctly return `b=c`
- 3-arg call with non-QUERY part (e.g., `parse_url(url, 'HOST', 'key')`)
returns NULL
- Empty port (`http://host:/path`) strips trailing colon
- Empty authority (`http:///path`) returns NULL instead of `""`
- ANSI mode: raises `[INVALID_URL]` error for malformed URLs (spaces,
control chars, missing scheme with `://`)
**Modified: `native/core/src/execution/jni_api.rs`**
- Registers `CometParseUrl`/`CometTryParseUrl` from `spark-expr` instead
of upstream `SparkParseUrl`/`SparkTryParseUrl`
**Modified: `spark/.../serde/url.scala`**
- Removes `Incompatible` override and `incompatibleReason` (now
`Compatible` by default)
**Modified: SQL test files**
- `parse_url.sql`: expanded from 4 fallback queries to 30+
native-execution queries covering all components, edge cases, and divergence
fixes
- `parse_url_ansi.sql`: enables previously-ignored ANSI error tests with
`expect_error(INVALID_URL)`
## How are these changes tested?
- 25 Rust unit tests in `parse_url.rs` covering all URL components,
divergence fixes, error handling, null propagation, and edge cases (query
values with `=`, 3-arg non-QUERY, empty port, empty authority, IPv6, malformed
URLs)
- SQL file tests (`parse_url.sql`, `parse_url_ansi.sql`) verified against
Spark on both `spark-4.0` and `spark-4.1` profiles
- Adversarial code review identified and fixed 4 additional correctness
issues before merge
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]