andygrove opened a new pull request, #3752:
URL: https://github.com/apache/datafusion-comet/pull/3752

   ## Which issue does this PR close?
   
   N/A - new tooling
   
   ## Rationale for this change
   
   The existing shuffle benchmarks use small synthetic data (8192 rows x 10 
batches) with Criterion, which makes it difficult to:
   - Benchmark with realistic data distributions from TPC-H/TPC-DS at scale
   - Profile with tools like `cargo flamegraph`, `perf`, or `instruments` 
(Criterion's harness interferes)
   - Test both write and read paths (current benchmarks are write-only)
   - Explore different scenarios like spilling, high partition counts, or codec 
comparisons
   
   ## What changes are included in this PR?
   
   Adds a `shuffle_bench` binary (`native/core/src/bin/shuffle_bench.rs`) that 
benchmarks Comet shuffle write and read performance independently from Spark.
   
   Features:
   - **Parquet input**: Point at TPC-H/TPC-DS Parquet files for realistic data 
distributions
   - **Synthetic data generation**: Configurable schema with int, string, 
decimal, and date columns
   - **Write + read benchmarking**: `--read-back` decodes all partitions and 
reports throughput
   - **Configurable scenarios**: partitioning (hash/single/round-robin), 
partition count, compression (none/lz4/zstd/snappy), memory limit for spilling
   - **Profiler-friendly**: Single long-running process with warmup and 
iteration support
   
   Example usage:
   ```sh
   # Benchmark with TPC-H data
   cargo run --release --bin shuffle_bench -- \
     --input /data/tpch-sf100/lineitem/ \
     --partitions 200 --codec zstd --read-back
   
   # Generate synthetic data
   cargo run --release --bin shuffle_bench -- \
     --generate --gen-rows 10000000 \
     --partitions 200 --codec lz4 --read-back --iterations 3 --warmup 1
   
   # Profile with flamegraph
   cargo flamegraph --release --bin shuffle_bench -- \
     --input /data/lineitem/ --partitions 200 --codec zstd
   ```
   
   ## How are these changes tested?
   
   Manually tested with both generated data and various configurations:
   - Different codecs (none, lz4, zstd, snappy)
   - Read-back verification (all rows decoded correctly)
   - Multiple iterations with warmup
   - Clippy clean, cargo fmt applied


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to