kosiew commented on code in PR #22273:
URL: https://github.com/apache/datafusion/pull/22273#discussion_r3292792165
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -995,6 +995,74 @@ pub fn similar_to(
Ok(Arc::new(BinaryExpr::new(expr, binary_op, pattern)))
}
+/// Translate a SQL `SIMILAR TO` pattern into an equivalent POSIX regex.
+///
+/// PostgreSQL `SIMILAR TO` mixes SQL LIKE wildcards with POSIX-style regex
+/// metacharacters and requires the pattern to match the entire string.
+/// In particular:
+///
+/// * `%` matches any sequence of zero or more characters (like LIKE).
+/// * `_` matches exactly one character (like LIKE).
+/// * `|`, `*`, `+`, `?`, `()`, `{m[,n]}`, `[...]` keep their POSIX regex
meaning.
+/// * `.`, `^`, `$` are *literal* characters (not regex metacharacters).
+/// * `\` is the default escape character, so `\X` means a literal `X`.
+///
+/// The translated regex is wrapped with `^...$` so the regex engine enforces
+/// a full-string match.
+pub fn translate_similar_to_pattern(pattern: &str) -> String {
+ let mut out = String::with_capacity(pattern.len() + 2);
+ out.push('^');
+ let mut chars = pattern.chars().peekable();
+ while let Some(c) = chars.next() {
+ match c {
+ '%' => out.push_str(".*"),
+ '_' => out.push('.'),
+ '\\' => {
+ // Backslash escapes the next character in SIMILAR TO. Emit it
+ // as a literal in the regex by re-escaping it.
+ match chars.next() {
+ Some(next) => {
Review Comment:
I think this is still treating escaped characters as regex escapes rather
than regex literals.
The new invariant says that `\X` in `SIMILAR TO` should mean a literal `X`,
but this code turns the pattern `\d` into the regex `^\d$`. That matches a
digit and does not match `d`.
I can reproduce this on the branch with:
```sql
SELECT '5' SIMILAR TO '\d', 'd' SIMILAR TO '\d';
```
This returns `true, false`, but I would expect `false, true`.
Could we escape the next character as a regex literal instead, for example
by applying regex escaping to that single character, rather than always
prefixing it with `\`?
##########
datafusion/physical-expr/src/planner.rs:
##########
@@ -253,8 +255,41 @@ pub fn create_physical_expr(
}
let physical_expr =
create_physical_expr(expr, input_dfschema, execution_props)?;
+ // SIMILAR TO uses SQL wildcards (`%`, `_`) layered on POSIX regex
and
+ // requires a whole-string match. Translate literal patterns to an
+ // equivalent regex so the existing regex-match operator returns
+ // PostgreSQL-compatible results.
+ let translated_pattern = match pattern.as_ref() {
+ Expr::Literal(ScalarValue::Utf8(Some(s)), m) => Expr::Literal(
+ ScalarValue::Utf8(Some(translate_similar_to_pattern(s))),
+ m.clone(),
+ ),
+ Expr::Literal(ScalarValue::LargeUtf8(Some(s)), m) =>
Expr::Literal(
+
ScalarValue::LargeUtf8(Some(translate_similar_to_pattern(s))),
+ m.clone(),
+ ),
+ Expr::Literal(ScalarValue::Utf8View(Some(s)), m) =>
Expr::Literal(
+
ScalarValue::Utf8View(Some(translate_similar_to_pattern(s))),
+ m.clone(),
+ ),
+ // NULL pattern: regex match against NULL returns NULL. Use a
+ // typed Utf8 null so the regex kernel can handle it.
+ Expr::Literal(
+ ScalarValue::Utf8(None)
+ | ScalarValue::LargeUtf8(None)
+ | ScalarValue::Utf8View(None)
+ | ScalarValue::Null,
+ m,
+ ) => Expr::Literal(ScalarValue::Utf8(None), m.clone()),
+ _ => {
Review Comment:
This looks like it only fixes literal patterns, while non-literal `SIMILAR
TO` patterns are now rejected during physical planning.
That seems like a SQL-visible regression, since `SIMILAR TO` should accept
pattern expressions. For example, this should still be evaluable rather than
failing during planning:
```sql
SELECT s FROM test WHERE s SIMILAR TO s;
```
Could we move the translation to a layer that can handle runtime pattern
values, such as a physical expression or scalar kernel that translates the
pattern column before regex matching? Alternatively, please preserve the
currently supported non-literal cases while still enforcing SQL `SIMILAR TO`
semantics end to end.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]