LucaCappelletti94 opened a new issue, #2215: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/2215
Hi! I've created an open-source benchmark comparing Rust SQL parsers for PostgreSQL workloads and wanted to share the results with you. ## Benchmark Results  ## Methodology - **Framework**: Criterion.rs v0.8 with flat sampling mode, 50 samples, 3-second measurement time - **Workload**: Parsing batches of 1-1000 SQL statements concatenated into a single string - **Datasets**: - SELECT: 4,505 queries from Spider (Yale) + Gretel AI - INSERT: 992 queries from Gretel AI - UPDATE: 983 queries from Gretel AI - DELETE: 933 queries from Gretel AI - **Environment**: AMD Ryzen Threadripper PRO 5975WX, Ubuntu 24.04, Rust 2021 edition - **Dialect**: All parsers configured for PostgreSQL ## Results for sqlparser-rs **sqlparser-rs performs excellently in this benchmark:** - **1.5-2x faster** than FFI-based parsers (pg_query.rs, pg_parse) - **100% compatibility** with all test queries in our corpus - Best balance of speed, correctness, and multi-dialect support | Statement Type | 500 statements | |---------------|----------------| | SELECT | 5.68 ms | | INSERT | 4.90 ms | | UPDATE | 3.20 ms | | DELETE | 2.93 ms | ### Observations 1. Pure Rust implementation avoids FFI overhead, showing consistent performance advantage 2. The recursive descent parser handles complex queries (CTEs, window functions, nested subqueries) efficiently 3. Fuzz testing gives confidence in robustness that other parsers lack 4. Could improve performance by using a generic S for most strings, allowing for both `String` and `&str` to reduce the amount of cloning which happens both when creating the tokens and the statements ## Full Benchmark Repository <https://github.com/LucaCappelletti94/sql_ast_benchmark> The repository includes: - Complete benchmark code - All SQL test datasets - Reproducible methodology - Detailed README with analysis ## Feedback Request I'd appreciate any feedback on the benchmark methodology or if there are any improvements I should make: 1. Are there any parser configuration options that could improve performance? 2. Are there specific query patterns I should include in the test corpus? 3. Is there anything about the benchmark setup that might not represent real-world usage? Thank you for maintaining such an excellent library! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
