ding-young opened a new issue, #16938: URL: https://github.com/apache/datafusion/issues/16938
### Is your feature request related to a problem or challenge? [PR#16814](https://github.com/apache/datafusion/pull/16814) adds a new benchmark utility to retrieve memory statistics and print summary table. We can run the binary directly with `cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet --query 1`. However, there is still no integration with `bench.sh` to easily run individual benchmarks through mem_profile, nor is there a utility to compare results across different branches. ### Describe the solution you'd like ### Side Note The way `mem_profile` collects the metrics and prints them out is quite different to other existing benchmark utilities. For memory profiling, `mem_profile` spawns a new subprocess for each query execution. As a result, it does not generate a single output.json file for all bench queries like other benchmarks, but instead prints a summary table to stdout. To compare results across branches, we should either capture this stdout, or modify `mem_profile.rs` to also write results to a JSON file or other structured format. ### Steps 1. Navigate `bench.sh` and update places where it uses outdated entrypoint. e.g. replace `--bin tpch` with `dfbench -- tpch` (`mem_profile` passes subcommand and args to `dfbench`, so it would be easier to integrate it) 2. Add support for memory profiling mode in bench.sh Modify bench.sh so that setting MEM_PROFILE=true runs each benchmark through the mem_profile binary instead of dfbench directly. 3. Extend compare.py and mem_profile.rs to allow side-by-side comparison of memory usage across branches or runs ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org