jerry-024 commented on code in PR #271:
URL: https://github.com/apache/paimon-rust/pull/271#discussion_r3121129156
##########
crates/integrations/datafusion/src/sql_handler.rs:
##########
@@ -361,6 +373,143 @@ impl PaimonSqlHandler {
}
}
+/// Find `PARTITIONED BY` keyword position, skipping string literals and
comments.
+/// All offsets are byte-safe because SQL keywords and quote/comment
delimiters are ASCII.
+fn find_partitioned_by(sql: &str) -> Option<(usize, usize)> {
+ let bytes = sql.as_bytes();
+ let len = bytes.len();
+ let mut i = 0;
+ while i < len {
+ match bytes[i] {
+ b'\'' => {
+ i += 1;
+ while i < len {
+ if bytes[i] == b'\'' {
+ i += 1;
+ if i < len && bytes[i] == b'\'' {
+ i += 1;
+ } else {
+ break;
+ }
+ } else {
+ i += 1;
+ }
+ }
+ }
+ b'-' if i + 1 < len && bytes[i + 1] == b'-' => {
+ i += 2;
+ while i < len && bytes[i] != b'\n' {
+ i += 1;
+ }
+ }
+ _ => {
+ if sql[i..].len() >= 11 && sql[i..i +
11].eq_ignore_ascii_case("PARTITIONED") {
Review Comment:
This loop advances `i` one byte at a time, but then slices `sql[i..]` and
`sql[i..i + 11]`. As soon as the input contains a non-ASCII character outside
`...` or `--` comments, `i` can land in the middle of a UTF-8 codepoint and
panic before we return a `DataFusionError`. That makes this a blocker for me.
##########
crates/integrations/datafusion/src/sql_handler.rs:
##########
@@ -92,8 +91,9 @@ impl PaimonSqlHandler {
/// Execute a SQL statement. ALTER TABLE is handled by Paimon directly;
/// everything else is delegated to DataFusion.
pub async fn sql(&self, sql: &str) -> DFResult<DataFrame> {
+ let (rewritten_sql, partition_keys) = extract_partition_by(sql)?;
Review Comment:
Running `extract_partition_by` before parsing every statement changes
behavior beyond `CREATE TABLE`. The scanner currently skips only single-quoted
strings and `--` comments, so `PARTITIONED BY` inside `/* ... */` comments or
quoted identifiers can be misdetected and turn otherwise valid SQL into
`Expected ( after PARTITIONED BY`. I think this needs to be restricted to
`CREATE TABLE` parsing or made fully tokenizer/AST-aware before merge.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]