kosiew commented on code in PR #22273:
URL: https://github.com/apache/datafusion/pull/22273#discussion_r3292792165


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -995,6 +995,74 @@ pub fn similar_to(
     Ok(Arc::new(BinaryExpr::new(expr, binary_op, pattern)))
 }
 
+/// Translate a SQL `SIMILAR TO` pattern into an equivalent POSIX regex.
+///
+/// PostgreSQL `SIMILAR TO` mixes SQL LIKE wildcards with POSIX-style regex
+/// metacharacters and requires the pattern to match the entire string.
+/// In particular:
+///
+/// * `%` matches any sequence of zero or more characters (like LIKE).
+/// * `_` matches exactly one character (like LIKE).
+/// * `|`, `*`, `+`, `?`, `()`, `{m[,n]}`, `[...]` keep their POSIX regex 
meaning.
+/// * `.`, `^`, `$` are *literal* characters (not regex metacharacters).
+/// * `\` is the default escape character, so `\X` means a literal `X`.
+///
+/// The translated regex is wrapped with `^...$` so the regex engine enforces
+/// a full-string match.
+pub fn translate_similar_to_pattern(pattern: &str) -> String {
+    let mut out = String::with_capacity(pattern.len() + 2);
+    out.push('^');
+    let mut chars = pattern.chars().peekable();
+    while let Some(c) = chars.next() {
+        match c {
+            '%' => out.push_str(".*"),
+            '_' => out.push('.'),
+            '\\' => {
+                // Backslash escapes the next character in SIMILAR TO. Emit it
+                // as a literal in the regex by re-escaping it.
+                match chars.next() {
+                    Some(next) => {

Review Comment:
   I think this is still treating escaped characters as regex escapes rather 
than regex literals.
   
   The new invariant says that `\X` in `SIMILAR TO` should mean a literal `X`, 
but this code turns the pattern `\d` into the regex `^\d$`. That matches a 
digit and does not match `d`.
   
   I can reproduce this on the branch with:
   
   ```sql
   SELECT '5' SIMILAR TO '\d', 'd' SIMILAR TO '\d';
   ```
   
   This returns `true, false`, but I would expect `false, true`.
   
   Could we escape the next character as a regex literal instead, for example 
by applying regex escaping to that single character, rather than always 
prefixing it with `\`?



##########
datafusion/physical-expr/src/planner.rs:
##########
@@ -253,8 +255,41 @@ pub fn create_physical_expr(
             }
             let physical_expr =
                 create_physical_expr(expr, input_dfschema, execution_props)?;
+            // SIMILAR TO uses SQL wildcards (`%`, `_`) layered on POSIX regex 
and
+            // requires a whole-string match. Translate literal patterns to an
+            // equivalent regex so the existing regex-match operator returns
+            // PostgreSQL-compatible results.
+            let translated_pattern = match pattern.as_ref() {
+                Expr::Literal(ScalarValue::Utf8(Some(s)), m) => Expr::Literal(
+                    ScalarValue::Utf8(Some(translate_similar_to_pattern(s))),
+                    m.clone(),
+                ),
+                Expr::Literal(ScalarValue::LargeUtf8(Some(s)), m) => 
Expr::Literal(
+                    
ScalarValue::LargeUtf8(Some(translate_similar_to_pattern(s))),
+                    m.clone(),
+                ),
+                Expr::Literal(ScalarValue::Utf8View(Some(s)), m) => 
Expr::Literal(
+                    
ScalarValue::Utf8View(Some(translate_similar_to_pattern(s))),
+                    m.clone(),
+                ),
+                // NULL pattern: regex match against NULL returns NULL. Use a
+                // typed Utf8 null so the regex kernel can handle it.
+                Expr::Literal(
+                    ScalarValue::Utf8(None)
+                    | ScalarValue::LargeUtf8(None)
+                    | ScalarValue::Utf8View(None)
+                    | ScalarValue::Null,
+                    m,
+                ) => Expr::Literal(ScalarValue::Utf8(None), m.clone()),
+                _ => {

Review Comment:
   This looks like it only fixes literal patterns, while non-literal `SIMILAR 
TO` patterns are now rejected during physical planning.
   
   That seems like a SQL-visible regression, since `SIMILAR TO` should accept 
pattern expressions. For example, this should still be evaluable rather than 
failing during planning:
   
   ```sql
   SELECT s FROM test WHERE s SIMILAR TO s;
   ```
   
   Could we move the translation to a layer that can handle runtime pattern 
values, such as a physical expression or scalar kernel that translates the 
pattern column before regex matching? Alternatively, please preserve the 
currently supported non-literal cases while still enforcing SQL `SIMILAR TO` 
semantics end to end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to