aharpervc commented on PR #1809:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1809#issuecomment-2874016995

   > But actually I think if we can come up with something that doesn't touch 
that core parser loop we might have something mergeable. I'll reopen this PR, 
feel free to iterate if you have ideas to try out and I'm happy to take a look 
when you do!
   
   Sounds good... I will brainstorm on this 👍.
   
   ---
   
   It may also be useful to clarify what is my overall objective... basically, 
I think it'd be useful read a whole file into a string, then pass it to this 
library for parsing, then process the AST's. Since SQL code I work with (and in 
the industry) often includes `GO` batch delimiters in the file source code, and 
this library doesn't support batch delimiters outside of this PR, that 
presented a difficulty. Initially I was thinking that the smoothest approach 
would be to add batch delimiter support directly into the parser. However, this 
discussion is making me consider the pro/con more closely.
   
   Something else I was thinking about is how the current "parse statements" 
API _implies_ that you're only giving the parser sql of a _single batch_. Eg, 
`single batch string` -> `Vec<Statement>` makes complete sense in this model. 
However this PR introduced a much more odd model, where a single string could 
actually be multiple batches. Eg, `Vec<Batch>` -> `Vec<Statements>` is a 
strange API.
   
   So I think there could be a couple possible approaches going forward:
   
   _Option 1: Embrace Batches_
   
   We could introduce a Batch struct & new top level `String` -> `Vec<Batch>`, 
then `Batch` -> `Vec<Statements>`. `Batch` could be like
   
   ```rust
   struct Batch {
     statements: Vec<Statement>,
     count: u16 // or whatever
   }
   ```
   
   This seems useful because it makes explicit what was formerly implicit. 
Probably all the existing top level API's could be left alone, so all existing 
users uninterested in Batches would be unaffected. Then we'd want something 
like `parse_batches` to return a `Vec<Batch>` and then do all the normal 
parsing.
   
   The downside is, it would introduce a new concept & ongoing maintenance when 
perhaps there isn't a lot of interest. Also, not every dialect even has the 
concept of a batch, I think, so that's a difficulty here too.
   
   _Option 2: Reject Batches_
   
   Close this PR permanently & open a new PR, that would basically be a 
docs/readme update, to clarify what is the expected string users should provide 
to parse_sql & similar API's. The clarification is it needs to be the contents 
of a SQL batch, since that is the thing that can clearly be turned into a 
single `Vec<Statement>`. Dialects such as SQL Server which include batch 
delimiters in the SQL source code should be processed _outside_ of this library 
to split the batches, then the batch string can be given to sqlparser-rs.
   
   The downside is back to that original objective, which is reading a whole 
SQL file to a string then passing it to this library for parsing. If library 
users have to do our own detection of the batch delimiter keyword, now we all 
have to implement a mini parser before using the actual parser. That seems 
really unfortunate.
   
   _Option 3: Hold Batches out at arm's length_ (thesis, antithesis, 
...synthesis)
   
   I don't have a clear vision on this option yet, but it seems like this 
library is well positioned to do _some_ work on this topic, but perhaps not as 
much as full on Batch support. What if we had something like "parse_batches", 
but it returned a simple `Vec<String>`? I'm imagining something like:
   
   ```rust
   fn split_batches -> Vec<String> {
     let mut result = vec![];
     while let Some(token) = self.tokens.next() {
       if peek_keyword(Keyword::GO) {
         self.prev_token();
         result.push(parse_go());
       }
     }
   }
   
   ```
   
   The concept here is the library has the ability to scan through the tokens & 
parse the GO keyword, and split a single string into Vec<String> but would 
otherwise remain unchanged. Dialects that support batches could use this API to 
turn a whole file string into something suitable for parse_statements. Other 
dialects would remain unaffected.
   
   ---
   
   I'd be interested to hear anyone else's opinion on this too, for/against any 
option or to brainstorm other options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to