oc7o opened a new pull request, #23188:
URL: https://github.com/apache/datafusion/pull/23188

   `SIMILAR TO` previously passed the pattern straight to Arrow's regex engine, 
so SQL wildcards were never translated and matches were unanchored:
   
       SELECT 'abc' SIMILAR TO 'a%';  -- returned false
       SELECT 'x'   SIMILAR TO '_';   -- returned false
   
   Translate `%` to `.*` and `_` to `.`, then wrap the pattern in `^(?:...)$` 
so the regex matches the entire string. Other regex metacharacters (`|`, `(`, 
`)`, `*`, `+`, `?`) pass through unchanged, matching `SIMILAR TO`'s 
superset-of-regex semantics.
   
   The translation only fires for literal `Utf8`, `LargeUtf8`, and `Utf8View` 
patterns. Non-literal patterns return a `not_impl_err!` — silently wrong 
results are worse than an honest error, and this mirrors how DataFusion already 
handles the unsupported `ESCAPE` clause. NULL patterns pass through unchanged.
   
   ## Which issue does this PR close?
   
   - Closes #22263.
   
   ## Rationale for this change
   
   `SIMILAR TO` is a SQL standard operator with well-defined wildcard semantics 
(`%` = any sequence, `_` = single character, full-string match). DataFusion's 
current behavior silently produces wrong results for the most basic patterns, 
which is a correctness bug for anyone porting queries from Postgres or other 
SQL engines.
   
   ## What changes are included in this PR?
   
   - New `sql_similar_to_regex` helper in 
`datafusion/physical-expr/src/expressions/binary.rs` that translates `%`/`_` 
and anchors the pattern with `^(?:...)$`.
   - `similar_to()` now translates the pattern for literal `Utf8` / `LargeUtf8` 
/ `Utf8View` values, passes `NULL` through unchanged, and returns 
`not_impl_err!` for non-literal patterns.
   
   ## Are these changes tested?
   
   Yes:
   - Existing `test_similar_to` in `binary.rs` was relying on the bug by 
passing raw regex strings; rewritten to use SQL wildcard syntax.
   - New unit tests cover `%`, `_`, full-string anchoring, regex-metacharacter 
passthrough, NULL pattern, and the non-literal-pattern error path.
   - End-to-end coverage added in 
`datafusion/sqllogictest/test_files/strings.slt`.
   
   ## Are there any user-facing changes?
   
   Yes — `SIMILAR TO` now produces correct results for queries that were 
previously returning wrong answers. Queries that happened to rely on the buggy 
behavior (passing raw regex through `SIMILAR TO`) will change. No public API 
changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to