Re: [PR] POC to show performance improvements of not copying token [datafusion-sqlparser-rs]

via GitHub Tue, 10 Dec 2024 15:42:10 -0800


davisp commented on PR #1561:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1561#issuecomment-2533229913

https://github.com/apache/datafusion-sqlparser-rs/compare/main...davisp:datafusion-sqlparser-rs:pd/experiment/less-token-cloning

tl;dr - Roughly 28% speedup on the microbenchmarks compared to main. 18% of
that by tweaking a few of the main token methods in the last commit.

I ended up poking at this today and have managed to get about a 28%
improvement on the benchmarks over `main`. Roughly 10% of that is checking for
hot paths in the flame graphs and slowly converting those functions over to
avoid cloning tokens.

On a whim, I also poked at optimizing the Token::make_word method that does
a binary search over all keywords. That appears to have saved about 400ms (of
1.2s total originally attributed to it). I skimmed the docs for
[phf](https://crates.io/crates/phf) but I'm not certain what this project's
appetite is for new dependencies. It currently seems quite lean so I didn't
pursue that just yet.

Lastly, the biggest single commit was the latest that updates a few of the
token handling methods themselves to avoid clones. This was the biggest jump at
18% improvement for a total just around 28% (based on my very non-scientific
measurements).

Also, one thing I did different than you @alamb, I noted CI flagged your
EOF_TOKEN, so I just made mine a static since it doesn't have any
initialization.

All in all, this approach certainly seems feasible as well as actually
productive. It turns out the hardest part is accounting for error reporting
with the `expected("message", token)` API. I ended up adding an
`expected_current` which doesn't take a token parameter and uses the current
index for reporting. That works for now, but going forward I think we're gonna
want to figure out a much different error reporting API as its quite fragile
(you have to pay attention to which token is being reported and whether or not
the token index has been modified). There are also a number of places where the
token being reported in errors is incorrect.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] POC to show performance improvements of not copying token [datafusion-sqlparser-rs]

Reply via email to