jerry-024 commented on code in PR #271:
URL: https://github.com/apache/paimon-rust/pull/271#discussion_r3121129156


##########
crates/integrations/datafusion/src/sql_handler.rs:
##########
@@ -361,6 +373,143 @@ impl PaimonSqlHandler {
     }
 }
 
+/// Find `PARTITIONED BY` keyword position, skipping string literals and 
comments.
+/// All offsets are byte-safe because SQL keywords and quote/comment 
delimiters are ASCII.
+fn find_partitioned_by(sql: &str) -> Option<(usize, usize)> {
+    let bytes = sql.as_bytes();
+    let len = bytes.len();
+    let mut i = 0;
+    while i < len {
+        match bytes[i] {
+            b'\'' => {
+                i += 1;
+                while i < len {
+                    if bytes[i] == b'\'' {
+                        i += 1;
+                        if i < len && bytes[i] == b'\'' {
+                            i += 1;
+                        } else {
+                            break;
+                        }
+                    } else {
+                        i += 1;
+                    }
+                }
+            }
+            b'-' if i + 1 < len && bytes[i + 1] == b'-' => {
+                i += 2;
+                while i < len && bytes[i] != b'\n' {
+                    i += 1;
+                }
+            }
+            _ => {
+                if sql[i..].len() >= 11 && sql[i..i + 
11].eq_ignore_ascii_case("PARTITIONED") {

Review Comment:
   This loop advances `i` one byte at a time, but then slices `sql[i..]` and 
`sql[i..i + 11]`. As soon as the input contains a non-ASCII character outside 
`...` or `--` comments, `i` can land in the middle of a UTF-8 codepoint and 
panic before we return a `DataFusionError`. That makes this a blocker for me.



##########
crates/integrations/datafusion/src/sql_handler.rs:
##########
@@ -92,8 +91,9 @@ impl PaimonSqlHandler {
     /// Execute a SQL statement. ALTER TABLE is handled by Paimon directly;
     /// everything else is delegated to DataFusion.
     pub async fn sql(&self, sql: &str) -> DFResult<DataFrame> {
+        let (rewritten_sql, partition_keys) = extract_partition_by(sql)?;

Review Comment:
   Running `extract_partition_by` before parsing every statement changes 
behavior beyond `CREATE TABLE`. The scanner currently skips only single-quoted 
strings and `--` comments, so `PARTITIONED BY` inside `/* ... */` comments or 
quoted identifiers can be misdetected and turn otherwise valid SQL into 
`Expected ( after PARTITIONED BY`. I think this needs to be restricted to 
`CREATE TABLE` parsing or made fully tokenizer/AST-aware before merge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to