parthchandra opened a new pull request, #4413:
URL: https://github.com/apache/datafusion-comet/pull/4413

   ## Which issue does this PR close?
     
     Closes #4150.
   
     ## Rationale for this change
   
     PR #4350 wired `parse_url` through Comet's serde layer to the upstream 
`datafusion-spark` UDFs but marked it `Incompatible` due to divergences between 
the WHATWG URL Standard (`url` crate) and Spark's `java.net.URI` (RFC 3986). 
This PR replaces the upstream implementation with a local RFC 3986 regex-based
     parser that matches Spark's behavior, promoting `parse_url` from 
`Incompatible` to `Compatible`.
   
     ## What changes are included in this PR?
   
     **New: `native/spark-expr/src/url_funcs/parse_url.rs`**
     - `CometParseUrl` and `CometTryParseUrl` UDFs using [RFC 3986 Appendix 
B](https://www.rfc-editor.org/rfc/rfc3986#appendix-B) regex decomposition 
instead of the WHATWG `url` crate
     - Fixes all tracked divergences from Spark:
       1. Empty-string URL with PATH/FILE now returns `""` (was NULL)
       2. FILE on URL without explicit path (`http://host?foo=bar`) returns 
`?foo=bar` (was `/?foo=bar`)
       3. PATH on trailing slash (`http://host/`) returns `"/"` (was `""`)
       4. 3-arg QUERY with key returns raw percent-encoded value (was decoded)
     - Additional correctness fixes found during adversarial review:
       - Query values containing `=` (e.g., `?a=b=c`) now correctly return `b=c`
       - 3-arg call with non-QUERY part (e.g., `parse_url(url, 'HOST', 'key')`) 
returns NULL
       - Empty port (`http://host:/path`) strips trailing colon
       - Empty authority (`http:///path`) returns NULL instead of `""`
     - ANSI mode: raises `[INVALID_URL]` error for malformed URLs (spaces, 
control chars, missing scheme with `://`)
   
     **Modified: `native/core/src/execution/jni_api.rs`**
     - Registers `CometParseUrl`/`CometTryParseUrl` from `spark-expr` instead 
of upstream `SparkParseUrl`/`SparkTryParseUrl`
   
     **Modified: `spark/.../serde/url.scala`**
     - Removes `Incompatible` override and `incompatibleReason` (now 
`Compatible` by default)
   
     **Modified: SQL test files**
     - `parse_url.sql`: expanded from 4 fallback queries to 30+ 
native-execution queries covering all components, edge cases, and divergence 
fixes
     - `parse_url_ansi.sql`: enables previously-ignored ANSI error tests with 
`expect_error(INVALID_URL)`
   
     ## How are these changes tested?
   
     - 25 Rust unit tests in `parse_url.rs` covering all URL components, 
divergence fixes, error handling, null propagation, and edge cases (query 
values with `=`, 3-arg non-QUERY, empty port, empty authority, IPv6, malformed 
URLs)
     - SQL file tests (`parse_url.sql`, `parse_url_ansi.sql`) verified against 
Spark on both `spark-4.0` and `spark-4.1` profiles
     - Adversarial code review identified and fixed 4 additional correctness 
issues before merge
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to