Re: [PR] Remove Whitespace Tokens from Parser [datafusion-sqlparser-rs]

via GitHub Fri, 14 Nov 2025 06:33:49 -0800


iffyio commented on code in PR #2077:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/2077#discussion_r2527426005



##########
src/ast/mod.rs:
##########
@@ -4579,19 +4574,97 @@ impl fmt::Display for Statement {
                 if !legacy_options.is_empty() {
                     write!(f, " {}", display_separated(legacy_options, " "))?;
                 }
+
+                let mut null_symbol = "\\N";
+                let mut delimiter = '\t';
+                let mut quote = '"';
+                let mut escape = '\\';
+
+                // Apply options
+                for option in options {
+                    match option {
+                        CopyOption::Delimiter(c) => {
+                            delimiter = *c;
+                        }
+                        CopyOption::Quote(c) => {
+                            quote = *c;
+                        }
+                        CopyOption::Escape(c) => {
+                            escape = *c;
+                        }
+                        CopyOption::Null(null) => {
+                            null_symbol = null;
+                        }
+                        _ => {}

Review Comment:
   Can we make the match exhaustive instead of using wildcard to catch-all?
   i.e.
   ```rust
   CopyOption::Encoding(_) | CopyOption::ForceNull(_) ... => {}
   ```
   so that we're guided to revisit this code if the CopyOption enum is extended 
in the future



##########
src/ast/mod.rs:
##########
@@ -4579,19 +4574,97 @@ impl fmt::Display for Statement {
                 if !legacy_options.is_empty() {
                     write!(f, " {}", display_separated(legacy_options, " "))?;
                 }
+
+                let mut null_symbol = "\\N";
+                let mut delimiter = '\t';
+                let mut quote = '"';
+                let mut escape = '\\';
+
+                // Apply options
+                for option in options {
+                    match option {
+                        CopyOption::Delimiter(c) => {
+                            delimiter = *c;
+                        }
+                        CopyOption::Quote(c) => {
+                            quote = *c;
+                        }
+                        CopyOption::Escape(c) => {
+                            escape = *c;
+                        }
+                        CopyOption::Null(null) => {
+                            null_symbol = null;
+                        }
+                        _ => {}
+                    }
+                }
+
+                // Apply legacy options
+                for option in legacy_options {
+                    match option {
+                        CopyLegacyOption::Delimiter(c) => {
+                            delimiter = *c;
+                        }
+                        CopyLegacyOption::Null(null) => {
+                            null_symbol = null;
+                        }
+                        CopyLegacyOption::Csv(csv_options) => {
+                            for csv_option in csv_options {
+                                match csv_option {
+                                    CopyLegacyCsvOption::Quote(c) => {
+                                        quote = *c;
+                                    }
+                                    CopyLegacyCsvOption::Escape(c) => {
+                                        escape = *c;
+                                    }
+                                    _ => {}
+                                }
+                            }
+                        }
+                        _ => {}

Review Comment:
   Similarly, in both wildcards we can use an exhaustive match in place.



##########
src/parser/mod.rs:
##########
@@ -9597,6 +9533,178 @@ impl<'a> Parser<'a> {
         }
     }
 
+    pub fn parse_csv_body(

Review Comment:
   ```suggestion
       fn parse_csv_body(
   ```



##########
src/parser/mod.rs:
##########
@@ -9597,6 +9533,178 @@ impl<'a> Parser<'a> {
         }
     }
 
+    pub fn parse_csv_body(
+        &mut self,
+        options: &[CopyOption],
+        legacy_options: &[CopyLegacyOption],
+    ) -> Result<Vec<Vec<Option<String>>>, ParserError> {
+        let Token::CopyFromStdin(body) = self.next_token().token else {
+            return self.expected("COPY ... FROM STDIN with CSV body", 
self.peek_token());
+        };
+
+        let mut delimiter = '\t';
+        let mut quote = '"';
+        let mut escape = '\\';
+        let mut null_symbol = "\\N";
+
+        // Apply options
+        for option in options {
+            match option {
+                CopyOption::Delimiter(c) => {
+                    delimiter = *c;
+                }
+                CopyOption::Quote(c) => {
+                    quote = *c;
+                }
+                CopyOption::Escape(c) => {
+                    escape = *c;
+                }
+                CopyOption::Null(null) => {
+                    null_symbol = null;
+                }
+                _ => {}
+            }
+        }
+
+        // Apply legacy options
+        for option in legacy_options {
+            match option {
+                CopyLegacyOption::Delimiter(c) => {
+                    delimiter = *c;
+                }
+                CopyLegacyOption::Null(null) => {
+                    null_symbol = null;
+                }
+                CopyLegacyOption::Csv(csv_options) => {
+                    for csv_option in csv_options {
+                        match csv_option {
+                            CopyLegacyCsvOption::Quote(c) => {
+                                quote = *c;
+                            }
+                            CopyLegacyCsvOption::Escape(c) => {
+                                escape = *c;
+                            }
+                            _ => {}
+                        }
+                    }
+                }
+                _ => {}
+            }
+        }

Review Comment:
   this looks like a duplicate of the logic on the display path, can we pull it 
into a function and reuse in both places?



##########
tests/sqlparser_common.rs:
##########
@@ -3589,6 +3589,7 @@ fn test_double_value() {
 
     for (input, expected) in test_cases {
         for (i, expr) in input.iter().enumerate() {
+            println!("Testing expression: {}", expr);

Review Comment:
   ```suggestion
   ```



##########
src/parser/mod.rs:
##########
@@ -9597,6 +9533,178 @@ impl<'a> Parser<'a> {
         }
     }
 
+    pub fn parse_csv_body(
+        &mut self,
+        options: &[CopyOption],
+        legacy_options: &[CopyLegacyOption],
+    ) -> Result<Vec<Vec<Option<String>>>, ParserError> {
+        let Token::CopyFromStdin(body) = self.next_token().token else {
+            return self.expected("COPY ... FROM STDIN with CSV body", 
self.peek_token());
+        };
+
+        let mut delimiter = '\t';
+        let mut quote = '"';
+        let mut escape = '\\';
+        let mut null_symbol = "\\N";
+
+        // Apply options
+        for option in options {
+            match option {
+                CopyOption::Delimiter(c) => {
+                    delimiter = *c;
+                }
+                CopyOption::Quote(c) => {
+                    quote = *c;
+                }
+                CopyOption::Escape(c) => {
+                    escape = *c;
+                }
+                CopyOption::Null(null) => {
+                    null_symbol = null;
+                }
+                _ => {}
+            }
+        }
+
+        // Apply legacy options
+        for option in legacy_options {
+            match option {
+                CopyLegacyOption::Delimiter(c) => {
+                    delimiter = *c;
+                }
+                CopyLegacyOption::Null(null) => {
+                    null_symbol = null;
+                }
+                CopyLegacyOption::Csv(csv_options) => {
+                    for csv_option in csv_options {
+                        match csv_option {
+                            CopyLegacyCsvOption::Quote(c) => {
+                                quote = *c;
+                            }
+                            CopyLegacyCsvOption::Escape(c) => {
+                                escape = *c;
+                            }
+                            _ => {}
+                        }
+                    }
+                }
+                _ => {}
+            }
+        }
+
+        // Simple CSV parser

Review Comment:
   thinking instead of having a full csv parser it would probably be 
better/simpler to come up with a way that lets us just grab everything  up 
until the terminal tokens `\.` - looks like that was how it was handled 
previously



##########
src/dialect/mod.rs:
##########
@@ -178,6 +178,16 @@ pub trait Dialect: Debug + Any {
     /// Determine if a character is a valid unquoted identifier character
     fn is_identifier_part(&self, ch: char) -> bool;
 
+    /// Returns whether the dialect supports hyphenated identifiers

Review Comment:
   Can we elaborate here what an hyphenated identifier means? (preferrably with 
an example)



##########
src/tokenizer.rs:
##########
@@ -896,14 +929,37 @@ impl<'a> Tokenizer<'a> {
             line: 1,
             col: 1,
         };
+        let mut prev_keyword = None;
+        let mut cs_handler = CopyStdinHandler::default();
 
         let mut location = state.location();
-        while let Some(token) = self.next_token(&mut state, buf.last().map(|t| 
&t.token))? {
-            let span = location.span_to(state.location());
+        while let Some(token) = self.next_token(
+            &mut location,
+            &mut state,
+            buf.last().map(|t| &t.token),
+            prev_keyword,
+            false,
+        )? {
+            if let Token::Word(Word { keyword, .. }) = &token {
+                if *keyword != Keyword::NoKeyword {
+                    prev_keyword = Some(*keyword);
+                }
+            }
 
+            let span = location.span_to(state.location());
+            cs_handler.update(&token);
             buf.push(TokenWithSpan { token, span });
-
             location = state.location();
+
+            if cs_handler.is_in_copy_from_stdin() {

Review Comment:
   I'm thinking this solution with the `CopyStdinHandler` isn't ideal because 
we're effectively introducing a parser within the tokenizer.
   
   One thought related to the comment on the 'simple csv parser' - if it would 
work to have the parser when looking to parse a csv string blindly consume all 
tokens until the terminal (new lines can be detected by inspecting the token's 
span), I imagine the tokenizer might not require any supporting logic



##########
src/dialect/mod.rs:
##########
@@ -178,6 +178,16 @@ pub trait Dialect: Debug + Any {
     /// Determine if a character is a valid unquoted identifier character
     fn is_identifier_part(&self, ch: char) -> bool;
 
+    /// Returns whether the dialect supports hyphenated identifiers
+    fn supports_hyphenated_identifiers(&self) -> bool {
+        false
+    }
+
+    /// Returns whether the dialect supports path-like identifiers

Review Comment:
   Similarly, I think we would need to explain what path-like identifier means



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Remove Whitespace Tokens from Parser [datafusion-sqlparser-rs]

Reply via email to