kosiew commented on code in PR #23188:
URL: https://github.com/apache/datafusion/pull/23188#discussion_r3490600637


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -1099,6 +1099,20 @@ pub fn binary(
     Ok(Arc::new(BinaryExpr::new(lhs, op, rhs)))
 }
 
+fn sql_similar_to_regex(pattern: &str) -> String {
+    let mut result = String::with_capacity(pattern.len() + 6);
+    result.push_str("^(?:");
+    for ch in pattern.chars() {
+        match ch {
+            '%' => result.push_str(".*"),
+            '_' => result.push('.'),
+            c => result.push(c),

Review Comment:
   Thanks for tackling this. I think there is still one correctness issue here.
   
   `SIMILAR TO` currently copies every non-`%` and non-`_` character directly 
into the Arrow regex. That means regex metacharacters like `.`, `^`, and `$` 
are still treated as regex operators even though they are literals in SQL 
`SIMILAR TO` patterns.
   
   For example, the SQL pattern `a.` should only match the literal string `a.`, 
but the current translation produces `^(?:a.)$`, so `SELECT 'ab' SIMILAR TO 
'a.'` incorrectly returns true.
   
   Could we translate the SQL pattern grammar explicitly instead? SQL literals 
should be escaped for the regex, and only the metacharacters that `SIMILAR TO` 
actually defines should be emitted as regex syntax.



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -1099,6 +1099,20 @@ pub fn binary(
     Ok(Arc::new(BinaryExpr::new(lhs, op, rhs)))
 }
 
+fn sql_similar_to_regex(pattern: &str) -> String {
+    let mut result = String::with_capacity(pattern.len() + 6);
+    result.push_str("^(?:");
+    for ch in pattern.chars() {
+        match ch {
+            '%' => result.push_str(".*"),

Review Comment:
   One other correctness issue I noticed is that `%` and `_` are translated to 
`.*` and `.`, but Rust and Arrow regexes do not let `.` match newlines by 
default.
   
   SQL wildcards are expected to match any character, including newlines, so 
values containing `\n` still behave incorrectly. For example, `'a\nb' SIMILAR 
TO 'a%b'` should match, but the generated `^(?:a.*b)$` does not.
   
   Could we use a dot-all form such as `(?s:.*)` and `(?s:.)`, or an equivalent 
character class? It would also be great to add a regression test for this case.



##########
datafusion/sqllogictest/test_files/strings.slt:
##########
@@ -91,6 +91,27 @@ P1e1
 P1m1e1
 e1
 
+# SIMILAR TO with `%` wildcard (zero or more characters)

Review Comment:
   The new SLT coverage covers the common `%`, `_`, and anchoring cases well. 
After updating the translator, could we also add a small regression test 
showing that regex metacharacters are treated as literals? For example, 
`SIMILAR TO 'a.'` should match `'a.'` but not `'ab'`. That should help prevent 
accidental regressions back to raw regex semantics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to