davisp commented on PR #1561:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1561#issuecomment-2533229913

   
https://github.com/apache/datafusion-sqlparser-rs/compare/main...davisp:datafusion-sqlparser-rs:pd/experiment/less-token-cloning
   
   tl;dr - Roughly 28% speedup on the microbenchmarks compared to main. 18% of 
that by tweaking a few of the main token methods in the last commit.
   
   I ended up poking at this today and have managed to get about a 28% 
improvement on the benchmarks over `main`. Roughly 10% of that is checking for 
hot paths in the flame graphs and slowly converting those functions over to 
avoid cloning tokens.
   
   On a whim, I also poked at optimizing the Token::make_word method that does 
a binary search over all keywords. That appears to have saved about 400ms (of 
1.2s total originally attributed to it). I skimmed the docs for 
[phf](https://crates.io/crates/phf) but I'm not certain what this project's 
appetite is for new dependencies. It currently seems quite lean so I didn't 
pursue that just yet.
   
   Lastly, the biggest single commit was the latest that updates a few of the 
token handling methods themselves to avoid clones. This was the biggest jump at 
18% improvement for a total just around 28% (based on my very non-scientific 
measurements).
   
   Also, one thing I did different than you @alamb, I noted CI flagged your 
EOF_TOKEN, so I just made mine a static since it doesn't have any 
initialization.
   
   All in all, this approach certainly seems feasible as well as actually 
productive. It turns out the hardest part is accounting for error reporting 
with the `expected("message", token)` API. I ended up adding an 
`expected_current` which doesn't take a token parameter and uses the current 
index for reporting. That works for now, but going forward I think we're gonna 
want to figure out a much different error reporting API as its quite fragile 
(you have to pay attention to which token is being reported and whether or not 
the token index has been modified). There are also a number of places where the 
token being reported in errors is incorrect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to