Piyush-08-bot commented on issue #1845: URL: https://github.com/apache/datafusion-ballista/issues/1845#issuecomment-4694753552
Thanks for confirming @milenkovicm! I checked apache/datafusion-benchmarks and found that TPC-DS is actually mostly ready there: - All 99 query files (q1.sql-q99.sql) already exist and can be copied over directly - There's a generic Rust runner pattern (datafusion-rust/main.rs) that registers tables and runs queries by number, similar in spirit to tpch.rs The main gap is data generation - TPC-H uses tpchgen-cli which is a simple cargo-installable binary, but TPC-DS data generation in datafusion-benchmarks requires downloading tpc-ds-tool.zip from TPC.org manually + building dsdgen via Docker. That's heavier and harder to fully automate in CI the same way as the TPC-H workflow. Given that, would it be reasonable to scope this PR as: - Add the TPC-DS query files + a benchmark binary (adapted from tpch.rs pattern) to run against a Ballista cluster - Start with a smaller subset of queries to keep things manageable - For the CI workflow, either use a pre-generated small dataset checked into CI cache, or document a manual data-gen step for now, and we can fully automate the data-gen part in a follow-up once we figure out a lighter-weight TPC-DS generator (similar to tpchgen-cli)? Let me know if this scope works or if you'd prefer a different approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
