andygrove commented on PR #4413:
URL: 
https://github.com/apache/datafusion-comet/pull/4413#issuecomment-4549834938

   Thanks. This is a good improvement, but I think there are still a few 
remaining gaps before we can claim full compatibility.
   
   Ran the PR locally against Spark on `parth/parse_url_2` and found four 
behavioral cases where Comet still diverges from Spark's `java.net.URI`. Each 
is small but reproducible. With the serde now defaulting to `Compatible`, it 
would be good to capture these in `sql-tests/expressions/url/` either as fixes, 
or as `query ignore(<tracking-issue>)` placeholders so future readers can find 
them.
   
   | Case | Query | Spark | Comet |
   |---|---|---|---|
   | Regex metachar in 3-arg query key | `parse_url('http://h/p?abc=1', 
'QUERY', '.bc')` | `1` | `null` |
   | Non-digit port | `parse_url('http://host:abc/', 'HOST')` | `null` | 
`host:abc` |
   | Backslash in URL | `parse_url('http://host/p\q', 'PATH')` | `null` | 
`/p\q` |
   | Unbalanced IPv6 bracket | `parse_url('http://[::1/path', 'AUTHORITY')` | 
`null` | `[::1` |
   
   Quick notes on each:
   
   1. Spark builds the key matcher as `Pattern.compile("(&|^)" + key + 
"=([^&]*)")` with no escaping. `.` in the key matches any character. 
`extract_query_value` in `parse_url.rs` does literal `split('&')` matching.
   2. `java.net.URI` rejects non-digit ports. `extract_host` in `parse_url.rs` 
only strips the trailing `:NNNN` when the port is all digits, otherwise returns 
the unstripped string.
   3. `has_invalid_uri_chars` in `parse_url.rs` only flags space, `{ } < >`, 
and bytes `< 0x20`. `java.net.URI` rejects a wider set (backslash, double 
quote, caret, etc.).
   4. `extract_host` returns `None` for an unbalanced bracket, but the 
surrounding regex still captures an authority and path, so `AUTHORITY`, `PATH`, 
and `FILE` come back non-null while Spark's URI throws and returns null 
everywhere.
   
   Happy to share the four SQL fixture files if useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to