Dandandan opened a new pull request, #22273:
URL: https://github.com/apache/datafusion/pull/22273

   ## Which issue does this PR close?
   
   - Closes #22263.
   
   ## Rationale for this change
   
   `SIMILAR TO` is supposed to mix SQL LIKE wildcards (`%`, `_`) with POSIX 
regex metacharacters and match the entire string. DataFusion was lowering 
`SIMILAR TO` directly to a `RegexMatch` over the pattern verbatim, so:
   
   - `'abc' SIMILAR TO 'a%'` returned `false` (PG: `true`) because `%` was 
passed through as a literal regex character.
   - `'abc' SIMILAR TO 'b'` returned `true` (PG: `false`) because the regex 
match isn't anchored.
   - `.`, `^`, `$` were treated as regex metacharacters instead of literals.
   
   ## What changes are included in this PR?
   
   - New `translate_similar_to_pattern()` helper in 
`datafusion/physical-expr/src/expressions/binary.rs` that converts a `SIMILAR 
TO` pattern into an equivalent POSIX regex:
     - Anchors the result with `^...$`.
     - `%` → `.*`, `_` → `.`.
     - Escapes `.`, `^`, `$` (literal in SIMILAR TO, meta in regex).
     - Passes POSIX metas (`|`, `*`, `+`, `?`, `()`, `{m,n}`, `[...]`) through 
unchanged.
     - Handles `\` as an escape for the next character.
     - Passes bracket expressions through verbatim, including a leading literal 
`]`/`^]`.
   - Planner (`datafusion/physical-expr/src/planner.rs`) now translates literal 
`Utf8` / `LargeUtf8` / `Utf8View` patterns at planning time. NULL patterns flow 
through as a typed `Utf8` null. Non-literal patterns return a clear 
`not_impl_err` rather than the previous silently-wrong behavior.
   - SQL layer (`datafusion/sql/src/expr/mod.rs`) pattern-type check widened to 
accept `LargeUtf8` and `Utf8View` literals (previously rejected even though the 
underlying regex match supports them).
   
   ## Are these changes tested?
   
   Yes:
   
   - Unit test `similar_to_pattern_translation` in `binary.rs` covers 
wildcards, anchoring, regex metas, literal `.`/`^`/`$`, backslash escapes, 
bracket expressions (including `[]abc]` and `[^]abc]`).
   - `datafusion/sqllogictest/test_files/strings.slt` has a new regression 
block exercising the bug-report case, `_` wildcard, anchoring, literal 
`.`/`^`/`$`, regex metas (`|`, `{m,n}`, `+`), backslash-escaped wildcards, 
`NULL` pattern, and the non-literal-pattern error.
   - The existing `SIMILAR TO 'p[12].*'` test in `strings.slt` was relying on 
the buggy regex-passthrough behavior (`p1e1` etc. matched only because `.` was 
treated as regex `.`); it's been changed to `'p[12]%'` which expresses the same 
intent under correct SIMILAR TO semantics.
   
   ## Are there any user-facing changes?
   
   Yes — `SIMILAR TO` is now PostgreSQL-compatible:
   
   - Behavior changes for patterns containing `%`, `_`, or unanchored matches. 
Existing patterns that relied on the previous regex-passthrough behavior may 
need to be updated (most obviously, change `.*` to `%`).
   - Non-literal patterns now return `Not yet implemented: SIMILAR TO with a 
non-literal pattern is not yet supported` instead of silently returning a regex 
match. This was almost certainly broken in practice already, but it is a 
visible error message change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to