ding-young opened a new issue, #16938:
URL: https://github.com/apache/datafusion/issues/16938

   ### Is your feature request related to a problem or challenge?
   
   [PR#16814](https://github.com/apache/datafusion/pull/16814) adds a new 
benchmark utility to retrieve memory statistics and print summary table. 
   
   We can run the binary directly with `cargo run --profile release-nonlto 
--bin mem_profile -- --bench-profile release-nonlto tpch --path 
benchmarks/data/tpch_sf1 --partitions 4 --format parquet --query 1`. However, 
there is still no integration with `bench.sh` to easily run individual 
benchmarks through mem_profile, nor is there a utility to compare results 
across different branches.
   
   ### Describe the solution you'd like
   
   ### Side Note
   The way `mem_profile` collects the metrics and prints them out is quite 
different to other existing benchmark utilities. 
   For memory profiling, `mem_profile` spawns a new subprocess for each query 
execution. As a result, it does not generate a single output.json file for all 
bench queries like other benchmarks, but instead prints a summary table to 
stdout. To compare results across branches, we should either capture this 
stdout, or modify `mem_profile.rs` to also write results to a JSON file or 
other structured format.
   
   ### Steps 
   1. Navigate `bench.sh` and update places where it uses outdated entrypoint. 
   e.g. replace `--bin tpch` with `dfbench -- tpch`  
   (`mem_profile` passes subcommand and args to `dfbench`, so it would be 
easier to integrate it) 
   2. Add support for memory profiling mode in bench.sh
   Modify bench.sh so that setting MEM_PROFILE=true runs each benchmark through 
the mem_profile binary instead of dfbench directly.
   3. Extend compare.py and mem_profile.rs to allow side-by-side comparison of 
memory usage across branches or runs
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to