andygrove opened a new pull request, #3752:
URL: https://github.com/apache/datafusion-comet/pull/3752
## Which issue does this PR close?
N/A - new tooling
## Rationale for this change
The existing shuffle benchmarks use small synthetic data (8192 rows x 10
batches) with Criterion, which makes it difficult to:
- Benchmark with realistic data distributions from TPC-H/TPC-DS at scale
- Profile with tools like `cargo flamegraph`, `perf`, or `instruments`
(Criterion's harness interferes)
- Test both write and read paths (current benchmarks are write-only)
- Explore different scenarios like spilling, high partition counts, or codec
comparisons
## What changes are included in this PR?
Adds a `shuffle_bench` binary (`native/core/src/bin/shuffle_bench.rs`) that
benchmarks Comet shuffle write and read performance independently from Spark.
Features:
- **Parquet input**: Point at TPC-H/TPC-DS Parquet files for realistic data
distributions
- **Synthetic data generation**: Configurable schema with int, string,
decimal, and date columns
- **Write + read benchmarking**: `--read-back` decodes all partitions and
reports throughput
- **Configurable scenarios**: partitioning (hash/single/round-robin),
partition count, compression (none/lz4/zstd/snappy), memory limit for spilling
- **Profiler-friendly**: Single long-running process with warmup and
iteration support
Example usage:
```sh
# Benchmark with TPC-H data
cargo run --release --bin shuffle_bench -- \
--input /data/tpch-sf100/lineitem/ \
--partitions 200 --codec zstd --read-back
# Generate synthetic data
cargo run --release --bin shuffle_bench -- \
--generate --gen-rows 10000000 \
--partitions 200 --codec lz4 --read-back --iterations 3 --warmup 1
# Profile with flamegraph
cargo flamegraph --release --bin shuffle_bench -- \
--input /data/lineitem/ --partitions 200 --codec zstd
```
## How are these changes tested?
Manually tested with both generated data and various configurations:
- Different codecs (none, lz4, zstd, snappy)
- Read-back verification (all rows decoded correctly)
- Multiple iterations with warmup
- Clippy clean, cargo fmt applied
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]