LucaCappelletti94 commented on code in PR #2077:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/2077#discussion_r2560649462


##########
src/tokenizer.rs:
##########
@@ -896,14 +929,37 @@ impl<'a> Tokenizer<'a> {
             line: 1,
             col: 1,
         };
+        let mut prev_keyword = None;
+        let mut cs_handler = CopyStdinHandler::default();
 
         let mut location = state.location();
-        while let Some(token) = self.next_token(&mut state, buf.last().map(|t| 
&t.token))? {
-            let span = location.span_to(state.location());
+        while let Some(token) = self.next_token(
+            &mut location,
+            &mut state,
+            buf.last().map(|t| &t.token),
+            prev_keyword,
+            false,
+        )? {
+            if let Token::Word(Word { keyword, .. }) = &token {
+                if *keyword != Keyword::NoKeyword {
+                    prev_keyword = Some(*keyword);
+                }
+            }
 
+            let span = location.span_to(state.location());
+            cs_handler.update(&token);
             buf.push(TokenWithSpan { token, span });
-
             location = state.location();
+
+            if cs_handler.is_in_copy_from_stdin() {

Review Comment:
   Basically, if you were to insert one or more None values (or empty strings), 
they would be represented as empty strings (if I am not mistaken) and therefore 
those values would be represented as `\t{empty string}\t`, and the tokenizer 
would strip the subsequent tabs, making that row of the CSV not reconstructible 
with the approach you described.
   
   Suppose you are parsing:
   
   ```SQL
   OPY public.actor (actor_id, first_name, last_name, last_update, value) FROM 
stdin;
   1    PENELOPE        GUINESS 2006-02-15 09:34:33 0.11111
   2    NICK                    2006-02-15 09:34:33 0.22222
   3    ED      CHASE   2006-02-15 09:34:33 0.312323
   4    JENNIFER        DAVIS   2006-02-15 09:34:33 0.3232
   \.
   ```
   
   At the entry with ID 2, you notice that a value is left blank to represent 
an optional entry, or analogously, an empty string which would still be a valid 
value. The tokenizer should leave these tabs to avoid loosing that information, 
as once it is stripped it cannot be reconstructed.
   
   That being said, I personally find this syntax cursed. It is light years 
from proper use of SQL lol.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to