[I] Improve performance by not copying `Token`s as much [datafusion-sqlparser-rs]

via GitHub Tue, 26 Nov 2024 07:12:32 -0800


alamb opened a new issue, #1558:
URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1558


   Part of https://github.com/apache/datafusion-sqlparser-rs/issues/1557
   
   While looking at the flamegraph posted to that ticket, one obvious source of 
improvement is to stop copying each token so much.
   
   Functions like `peek_token()`, and `expect_token()` copy the `Token` (which 
often includes a string)
   
   ```rust
   impl Parser {
       // returns an OWNED TokenWithLocation
       pub fn peek_token(&self) -> TokenWithLocation {
      ...
       }
   }
   ```
   
   For example
   
https://github.com/apache/datafusion-sqlparser-rs/blob/2e90e105a74bf9f50f2bad6c22992759ddb06880/src/parser/mod.rs#L3314-L3334
   
   In the above code, the `non_whitespace.cloned()` call actually copies the 
token (and string) even when many callsites only need to check what it is and 
could get by with a &
   
   ## Suggestions for improvements
   
   The biggest suggestion is to stop copying each token unless it actually 
needed a clone. For example instead of this
   
   ```rust
   impl Parser {
       // returns an OWNED TokenWithLocation
       pub fn peek_token(&self) -> TokenWithLocation {
      ...
       }
   }
   ```
   
   Make it like this
   
   ```rust
   impl Parser {
       // returns an reference to TokenWithLocation
       // (the & means no copying!)
       pub fn peek_token_ref(&self) -> &TokenWithLocation {
      ...
       }
   }
   ```
   
   I think we could do this without massive breaking changes by adding new 
functions like `peek_token_ref()` that returned a reference to the token rather 
than a clone
   
   ## Ideas
   
   I played around with it a bit and here is what I came up with. I think a 
similar pattern could be applied to other places
   
   
   ```rust
   impl Parser {
   ...
       /// Return the first non-whitespace token that has not yet been processed
       /// (or None if reached end-of-file) and mark it as processed. OK to call
       /// repeatedly after reaching EOF.
       pub fn next_token(&mut self) -> TokenWithLocation {
           self.next_token_ref().clone()
       }
   
       /// Return the first non-whitespace token that has not yet been processed
       /// (or None if reached end-of-file) and mark it as processed. OK to call
       /// repeatedly after reaching EOF.
       pub fn next_token_ref(&mut self) -> &TokenWithLocation {
           loop {
               self.index += 1;
               // skip whitespace
               if let Some(TokenWithLocation {
                   token: Token::Whitespace(_),
                   location: _,
               }) = self.tokens.get(self.index - 1) {
                   continue
               }
               break;
           }
           if (self.index - 1) < self.tokens.len() {
               &self.tokens[self.index - 1]
           }
           else {
               eof_token()
           }
       }
   ...
   }
   
   static EOF_TOKEN: OnceLock<TokenWithLocation> = OnceLock::new();
   fn eof_token() -> &'static TokenWithLocation {
       EOF_TOKEN.get_or_init(|| {
           TokenWithLocation::wrap(Token::EOF)
       })
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Improve performance by not copying `Token`s as much [datafusion-sqlparser-rs]

Reply via email to