[PR] build(bench): rework docker-compose TPC-H stack and add manual benchmark workflow [datafusion-ballista]

via GitHub Sat, 02 May 2026 07:17:42 -0700


andygrove opened a new pull request, #1646:
URL: https://github.com/apache/datafusion-ballista/pull/1646


   # Which issue does this PR close?
   
   <!-- We generally require a GitHub issue to be filed for all bug fixes and 
enhancements -->
   
   N/A
   
   # Rationale for this change
   
   The docker-compose benchmark setup had several latent bugs (broken `nc` 
healthchecks against `ubuntu:24.04`, `depends_on` without health gating, 
missing `tpch-gen.sh` referenced by `dev/integration-tests.sh`, obsolete 
compose `version` key, mismatched image names vs 
`dev/build-ballista-docker.sh`) and was not a useful benchmark target out of 
the box (TBL format, one iteration, default `--partitions 2`). It also wasn't 
reachable from CI.
   
   This PR makes the stack runnable end-to-end again, brings the defaults to a 
sensible benchmark shape, and adds an on-demand GitHub workflow so anyone can 
run it against `main` or a PR/branch.
   
   # What changes are included in this PR?
   
   `docker-compose.yml`
   - Drop obsolete `version: '3.3'`.
   - Align `image:` tags with what `dev/build-ballista-docker.sh` produces 
(`apache/datafusion-ballista-{scheduler,executor,benchmarks}:latest`) so 
prebuilt images get reused.
   - Replace broken `nc -z` healthchecks with `bash </dev/tcp/127.0.0.1/PORT>` 
(no extra packages required).
   - Switch `depends_on` to the long form with `condition: service_healthy`; 
remove the now-unneeded `--scheduler-connect-timeout-seconds 15`.
   - Executor: `--concurrent-tasks 8`, `--memory-pool-size 8GB`, `--work-dir 
/work` (tmpfs), `cpus: '8'` and `memory: 12G` per replica, 2 replicas.
   - Drop the misleading `50051:50051` publish on `ballista-client` (it just 
sleeps).
   - Mount `benchmarks/queries` and `benchmarks/run.sh` into the client 
read-only for fast iteration.
   - Consistent `restart: "no"` for benchmark runs.
   
   `benchmarks/tpch-gen.sh` (new)
   - Generates partitioned Parquet via 
[`tpchgen-cli`](https://crates.io/crates/tpchgen-cli) (auto-installed via 
`cargo install` if missing).
   - Defaults: `SCALE_FACTOR=10`, `PARTITIONS=16`, 
`OUTPUT_DIR=benchmarks/data`. All env-overridable.
   
   `benchmarks/run.sh`
   - `--format parquet --partitions 16 --iterations 3`, env-overridable.
   - Drop the legacy `--format tbl` + `--expected /data` correctness branch 
(CI's `verify-benchmark-results` job already covers correctness).
   
   `dev/integration-tests.sh`
   - Use `docker compose` v2.
   - `trap`-based teardown so `down` always runs.
   - Wait for healthy services with `--wait`.
   - Forward `SCALE_FACTOR` / `PARTITIONS` / `ITERATIONS` to gen + run.
   
   `.github/workflows/benchmark.yml` (new)
   - `workflow_dispatch` only — never runs automatically on PRs.
   - Inputs: `ref` (default `main`), `scale_factor` (10), `partitions` (16), 
`iterations` (3).
   - Uses ASF-funded runs-on.com large runner (`cpu=32`, `family=m8a+m7a+c8a`, 
`image=ubuntu24-full-x64`, `tag=ballista`) when `repository_owner == 'apache'`; 
falls back to `ubuntu-latest` on forks so the workflow can still validate 
end-to-end pipeline changes.
   - Dumps compose logs on failure; 120-min timeout.
   
   # Are there any user-facing changes?
   
   No public API changes. Benchmark/dev-tooling only. Behaviour changes for 
users running `dev/integration-tests.sh` or `docker compose up`:
   - Generated data is now Parquet under `benchmarks/data` (was TBL).
   - Default benchmark shape is SF10, 16 partitions, 3 iterations (was 
effectively SF1, 2 partitions, 1 iteration).
   - Each executor replica is bound to 8 vCPU / 12 GB / 8 GB memory pool — 
meaningful only on hosts with at least 16 vCPU available to Docker.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] build(bench): rework docker-compose TPC-H stack and add manual benchmark workflow [datafusion-ballista]

Reply via email to