This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new a01774ff6 chore: Add Docker Compose support for TPC benchmarks (#3576)
a01774ff6 is described below
commit a01774ff6949b6f19084e02336a42da1d50175e2
Author: Andy Grove <[email protected]>
AuthorDate: Tue Feb 24 09:08:25 2026 -0700
chore: Add Docker Compose support for TPC benchmarks (#3576)
* Add Docker Compose support for TPC benchmarks
Add Docker Compose setup for running TPC-H/TPC-DS benchmarks in an
isolated Spark standalone cluster with two workers. Bundle TPC query
SQL files in the repository, removing the need for external
TPCH_QUERIES/TPCDS_QUERIES environment variables. Add
Dockerfile.build-comet for cross-compiling Comet JARs with Linux
native libraries on macOS.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
* Add lightweight Docker Compose file for laptop benchmarking
Add docker-compose-laptop.yml with a single worker (~12 GB total) for
SF1-SF10 testing on laptops, replacing the --scale/env var workaround.
Update README to document both compose files side by side.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
* revert a docs change
* Fix undefined variable warning in Dockerfile.build-comet
Remove self-reference to LD_LIBRARY_PATH since it is not defined
earlier in the Dockerfile, causing a Docker build warning.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
* fix
* prettier
* fix
* prettier
---------
Co-authored-by: Claude Opus 4.6 <[email protected]>
---
.gitignore | 2 +
benchmarks/tpc/README.md | 178 ++++++++++++++++++++-
benchmarks/tpc/infra/docker/Dockerfile | 58 +++++++
benchmarks/tpc/infra/docker/Dockerfile.build-comet | 76 +++++++++
.../tpc/infra/docker/docker-compose-laptop.yml | 97 +++++++++++
benchmarks/tpc/infra/docker/docker-compose.yml | 131 +++++++++++++++
benchmarks/tpc/run.py | 9 +-
benchmarks/tpc/tpcbench.py | 19 +--
8 files changed, 550 insertions(+), 20 deletions(-)
diff --git a/.gitignore b/.gitignore
index 05b37627b..15cac247e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -17,8 +17,10 @@ filtered_rat.txt
dev/dist
apache-rat-*.jar
venv
+.venv
dev/release/comet-rm/workdir
spark/benchmarks
.DS_Store
comet-event-trace.json
__pycache__
+output
diff --git a/benchmarks/tpc/README.md b/benchmarks/tpc/README.md
index 779ad1753..eb1fb0479 100644
--- a/benchmarks/tpc/README.md
+++ b/benchmarks/tpc/README.md
@@ -26,6 +26,10 @@ For full instructions on running these benchmarks on an EC2
instance, see the [C
[Comet Benchmarking on EC2 Guide]:
https://datafusion.apache.org/comet/contributor-guide/benchmarking_aws_ec2.html
+## Setup
+
+TPC queries are bundled in `benchmarks/tpc/queries/` (derived from TPC-H/DS
under the TPC Fair Use Policy).
+
## Usage
All benchmarks are run via `run.py`:
@@ -55,10 +59,9 @@ export SPARK_HOME=/opt/spark-3.5.3-bin-hadoop3/
export SPARK_MASTER=spark://yourhostname:7077
```
-Set path to queries and data:
+Set path to data (TPC queries are bundled in `benchmarks/tpc/queries/`):
```shell
-export TPCH_QUERIES=/mnt/bigdata/tpch/queries/
export TPCH_DATA=/mnt/bigdata/tpch/sf100/
```
@@ -135,9 +138,9 @@ $SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 \
--conf spark.driver.memory=8G \
- --conf spark.executor.instances=1 \
+ --conf spark.executor.instances=2 \
--conf spark.executor.cores=8 \
- --conf spark.cores.max=8 \
+ --conf spark.cores.max=16 \
--conf spark.executor.memory=16g \
create-iceberg-tables.py \
--benchmark tpch \
@@ -166,7 +169,6 @@ export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export COMET_JAR=/opt/comet/comet-spark-spark3.5_2.12-0.10.0.jar
export ICEBERG_JAR=/path/to/iceberg-spark-runtime-3.5_2.12-1.8.1.jar
export ICEBERG_WAREHOUSE=/mnt/bigdata/iceberg-warehouse
-export TPCH_QUERIES=/mnt/bigdata/tpch/queries/
sudo ./drop-caches.sh
python3 run.py --engine comet-iceberg --benchmark tpch
```
@@ -185,6 +187,172 @@ physical plan output.
| `--catalog` | No | `local` | Iceberg catalog name
|
| `--database` | No | benchmark name | Database name for the tables
|
+## Running with Docker
+
+A Docker Compose setup is provided in `infra/docker/` for running benchmarks
in an isolated
+Spark standalone cluster. The Docker image supports both **Linux (amd64)** and
**macOS (arm64)**
+via architecture-agnostic Java symlinks created at build time.
+
+### Build the image
+
+The image must be built for the correct platform to match the native libraries
in the
+engine JARs (e.g. Comet bundles `libcomet.so` for a specific OS/arch).
+
+```shell
+docker build -t comet-bench -f benchmarks/tpc/infra/docker/Dockerfile .
+```
+
+### Building a compatible Comet JAR
+
+The Comet JAR contains platform-specific native libraries (`libcomet.so` /
`libcomet.dylib`).
+A JAR built on the host may not work inside the Docker container due to OS,
architecture,
+or glibc version mismatches. Use `Dockerfile.build-comet` to build a JAR with
compatible
+native libraries:
+
+- **macOS (Apple Silicon):** The host JAR contains `darwin/aarch64` libraries
which
+ won't work in Linux containers. You **must** use the build Dockerfile.
+- **Linux:** If your host glibc version differs from the container's, the
native library
+ will fail to load with a `GLIBC_x.xx not found` error. The build Dockerfile
uses
+ Ubuntu 20.04 (glibc 2.31) for broad compatibility. Use it if you see
+ `UnsatisfiedLinkError` mentioning glibc when running benchmarks.
+
+```shell
+mkdir -p output
+docker build -t comet-builder \
+ -f benchmarks/tpc/infra/docker/Dockerfile.build-comet .
+docker run --rm -v $(pwd)/output:/output comet-builder
+export COMET_JAR=$(pwd)/output/comet-spark-spark3.5_2.12-*.jar
+```
+
+### Platform notes
+
+**macOS (Apple Silicon):** Docker Desktop is required.
+
+- **Memory:** Docker Desktop defaults to a small memory allocation (often 8
GB) which
+ is not enough for Spark benchmarks. Go to **Docker Desktop > Settings >
Resources >
+ Memory** and increase it to at least 48 GB (each worker requests 16 GB for
its executor
+ plus overhead, and the driver needs 8 GB). Without enough memory, executors
will be
+ OOM-killed (exit code 137).
+- **File Sharing:** You may need to add your data directory (e.g. `/opt`) to
+ **Docker Desktop > Settings > Resources > File Sharing** before mounting
host volumes.
+
+**Linux (amd64):** Docker uses cgroup memory limits directly without a VM
layer. No
+special Docker configuration is needed, but you may still need to build the
Comet JAR
+using `Dockerfile.build-comet` (see above) if your host glibc version doesn't
match
+the container's.
+
+The Docker image auto-detects the container architecture (amd64/arm64) and
sets up
+arch-agnostic Java symlinks. The compose file uses `BENCH_JAVA_HOME` (not
`JAVA_HOME`)
+to avoid inheriting the host's Java path into the container.
+
+### Start the cluster
+
+Set environment variables pointing to your host paths, then start the Spark
master and
+two workers:
+
+```shell
+export DATA_DIR=/mnt/bigdata/tpch/sf100
+export RESULTS_DIR=/tmp/bench-results
+export COMET_JAR=/opt/comet/comet-spark-spark3.5_2.12-0.10.0.jar
+
+mkdir -p $RESULTS_DIR/spark-events
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml up -d
+```
+
+Set `COMET_JAR`, `GLUTEN_JAR`, or `ICEBERG_JAR` to the host path of the engine
JAR you
+want to use. Each JAR is mounted individually into the container, so you can
easily switch
+between versions by changing the path and restarting.
+
+### Run benchmarks
+
+Use `docker compose run --rm` to execute benchmarks. The `--rm` flag removes
the
+container when it exits, preventing port conflicts on subsequent runs. Pass
+`--no-restart` since the cluster is already managed by Compose, and `--output
/results`
+so that output files land in the mounted results directory:
+
+```shell
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml \
+ run --rm -p 4040:4040 bench \
+ python3 /opt/benchmarks/run.py \
+ --engine comet --benchmark tpch --output /results --no-restart
+```
+
+The `-p 4040:4040` flag exposes the Spark Application UI on the host. The
following
+UIs are available during a benchmark run:
+
+| UI | URL |
+| ----------------- | ---------------------- |
+| Spark Master | http://localhost:8080 |
+| Worker 1 | http://localhost:8081 |
+| Worker 2 | http://localhost:8082 |
+| Spark Application | http://localhost:4040 |
+| History Server | http://localhost:18080 |
+
+> **Note:** The Master UI links to the Application UI using the container's
internal
+> hostname, which is not reachable from the host. Use `http://localhost:4040`
directly
+> to access the Application UI.
+
+The Spark Application UI is only available while a benchmark is running. To
inspect
+completed runs, uncomment the `history-server` service in `docker-compose.yml`
and
+restart the cluster. The History Server reads event logs from
`$RESULTS_DIR/spark-events`.
+
+For Gluten (requires Java 8), you must restart the **entire cluster** with
`JAVA_HOME`
+set so that all services (master, workers, and bench) use Java 8:
+
+```shell
+export BENCH_JAVA_HOME=/usr/lib/jvm/java-8-openjdk
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml down
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml up -d
+
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml \
+ run --rm bench \
+ python3 /opt/benchmarks/run.py \
+ --engine gluten --benchmark tpch --output /results --no-restart
+```
+
+> **Important:** Only passing `-e JAVA_HOME=...` to the `bench` container is
not
+> sufficient -- the workers also need Java 8 or Gluten will fail at runtime
with
+> `sun.misc.Unsafe` errors. Unset `BENCH_JAVA_HOME` (or switch it back to Java
17)
+> and restart the cluster before running Comet or Spark benchmarks.
+
+### Memory limits
+
+Two compose files are provided for different hardware profiles:
+
+| File | Workers | Total memory | Use case
|
+| --------------------------- | ------- | ------------ |
------------------------------ |
+| `docker-compose.yml` | 2 | ~74 GB | SF100+ on a
workstation/server |
+| `docker-compose-laptop.yml` | 1 | ~12 GB | SF1–SF10 on a laptop
|
+
+**`docker-compose.yml`** (workstation default):
+
+| Container | Container limit (`mem_limit`) | Spark JVM allocation |
+| -------------- | ----------------------------- | ------------------------- |
+| spark-worker-1 | 32 GB | 16 GB executor + overhead |
+| spark-worker-2 | 32 GB | 16 GB executor + overhead |
+| bench (driver) | 10 GB | 8 GB driver |
+| **Total** | **74 GB** | |
+
+Configure via environment variables: `WORKER_MEM_LIMIT` (default: 32g per
worker),
+`BENCH_MEM_LIMIT` (default: 10g), `WORKER_MEMORY` (default: 16g, Spark
executor memory),
+`WORKER_CORES` (default: 8).
+
+### Running on a laptop with small scale factors
+
+For local development or testing with small scale factors (e.g. SF1 or SF10),
use the
+laptop compose file which runs a single worker with reduced memory:
+
+```shell
+docker compose -f benchmarks/tpc/infra/docker/docker-compose-laptop.yml up -d
+```
+
+This starts one worker (4 GB executor inside an 8 GB container) and a 4 GB
bench
+container, totaling approximately **12 GB** of memory.
+
+The benchmark scripts request 2 executor instances and 16 max cores by default
+(`run.py`). Spark will simply use whatever resources are available on the
single worker,
+so no script changes are needed.
+
### Comparing Parquet vs Iceberg performance
Run both benchmarks and compare:
diff --git a/benchmarks/tpc/infra/docker/Dockerfile
b/benchmarks/tpc/infra/docker/Dockerfile
new file mode 100644
index 000000000..60567536a
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/Dockerfile
@@ -0,0 +1,58 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Benchmark image for running TPC-H and TPC-DS benchmarks across engines
+# (Spark, Comet, Gluten).
+#
+# Build (from repository root):
+# docker build -t comet-bench -f benchmarks/tpc/infra/docker/Dockerfile .
+
+ARG SPARK_IMAGE=apache/spark:3.5.2-python3
+FROM ${SPARK_IMAGE}
+
+USER root
+
+# Install Java 8 (Gluten) and Java 17 (Comet) plus Python 3.
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends \
+ openjdk-8-jdk-headless \
+ openjdk-17-jdk-headless \
+ python3 python3-pip procps \
+ && apt-get clean \
+ && rm -rf /var/lib/apt/lists/*
+
+# Default to Java 17 (override with JAVA_HOME at runtime for Gluten).
+# Detect architecture (amd64 or arm64) so the image works on both Linux and
macOS.
+ARG TARGETARCH
+RUN ln -s /usr/lib/jvm/java-17-openjdk-${TARGETARCH}
/usr/lib/jvm/java-17-openjdk && \
+ ln -s /usr/lib/jvm/java-8-openjdk-${TARGETARCH} /usr/lib/jvm/java-8-openjdk
+ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+
+# Copy the benchmark scripts into the image.
+COPY benchmarks/tpc/run.py /opt/benchmarks/run.py
+COPY benchmarks/tpc/tpcbench.py /opt/benchmarks/tpcbench.py
+COPY benchmarks/tpc/engines /opt/benchmarks/engines
+COPY benchmarks/tpc/queries /opt/benchmarks/queries
+COPY benchmarks/tpc/create-iceberg-tables.py
/opt/benchmarks/create-iceberg-tables.py
+COPY benchmarks/tpc/generate-comparison.py
/opt/benchmarks/generate-comparison.py
+
+# Engine JARs are bind-mounted or copied in at runtime via --jars.
+# Data and query paths are also bind-mounted.
+
+WORKDIR /opt/benchmarks
+
+# Defined in the base apache/spark image.
+ARG spark_uid
+USER ${spark_uid}
diff --git a/benchmarks/tpc/infra/docker/Dockerfile.build-comet
b/benchmarks/tpc/infra/docker/Dockerfile.build-comet
new file mode 100644
index 000000000..af5a0257a
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/Dockerfile.build-comet
@@ -0,0 +1,76 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Build a Comet JAR with native libraries for the current platform.
+#
+# This is useful on macOS (Apple Silicon) where the host-built JAR contains
+# darwin/aarch64 native libraries but Docker containers need linux/aarch64.
+#
+# Usage (from repository root):
+# docker build -t comet-builder -f
benchmarks/tpc/infra/docker/Dockerfile.build-comet .
+# docker run --rm -v $(pwd)/output:/output comet-builder
+#
+# The JAR is copied to ./output/ on the host.
+
+# Use Ubuntu 20.04 to match the GLIBC version (2.31) in apache/spark images.
+FROM ubuntu:20.04 AS builder
+
+ARG TARGETARCH
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Install build dependencies: Java 17, Maven wrapper prerequisites, GCC 11.
+# Ubuntu 20.04's default GCC 9 has a memcmp bug (GCC #95189) that breaks
aws-lc-sys.
+RUN apt-get update && apt-get install -y --no-install-recommends \
+ openjdk-17-jdk-headless \
+ curl ca-certificates git pkg-config \
+ libssl-dev unzip software-properties-common \
+ && add-apt-repository -y ppa:ubuntu-toolchain-r/test \
+ && apt-get update \
+ && apt-get install -y --no-install-recommends gcc-11 g++-11 make \
+ && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 110 \
+ && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 110 \
+ && update-alternatives --install /usr/bin/cc cc /usr/bin/gcc-11 110 \
+ && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+# Install protoc 25.x (Ubuntu 22.04's protoc is too old for proto3 optional
fields).
+ARG PROTOC_VERSION=25.6
+RUN ARCH=$(uname -m) && \
+ if [ "$ARCH" = "aarch64" ]; then PROTOC_ARCH="linux-aarch_64"; \
+ else PROTOC_ARCH="linux-x86_64"; fi && \
+ curl -sLO
"https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-${PROTOC_ARCH}.zip"
&& \
+ unzip -o "protoc-${PROTOC_VERSION}-${PROTOC_ARCH}.zip" -d /usr/local
bin/protoc && \
+ rm "protoc-${PROTOC_VERSION}-${PROTOC_ARCH}.zip" && \
+ protoc --version
+
+# Set JAVA_HOME and LD_LIBRARY_PATH so the Rust build can find libjvm.
+RUN ln -s /usr/lib/jvm/java-17-openjdk-${TARGETARCH}
/usr/lib/jvm/java-17-openjdk
+ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+ENV LD_LIBRARY_PATH=${JAVA_HOME}/lib/server
+
+# Install Rust.
+RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
+ENV PATH="/root/.cargo/bin:${PATH}"
+
+WORKDIR /build
+
+# Copy the full source tree.
+COPY . .
+
+# Build native code + package the JAR (skip tests).
+RUN make release-nogit
+
+# The entrypoint copies the built JAR to /output (bind-mounted from host).
+RUN mkdir -p /output
+CMD ["sh", "-c", "cp spark/target/comet-spark-spark3.5_2.12-*-SNAPSHOT.jar
/output/ && echo 'Comet JAR copied to /output/' && ls -lh /output/*.jar"]
diff --git a/benchmarks/tpc/infra/docker/docker-compose-laptop.yml
b/benchmarks/tpc/infra/docker/docker-compose-laptop.yml
new file mode 100644
index 000000000..6c5d8dbaf
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/docker-compose-laptop.yml
@@ -0,0 +1,97 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lightweight Spark standalone cluster for TPC benchmarks on a laptop.
+#
+# Single worker, ~12 GB total memory. Suitable for SF1-SF10 testing.
+#
+# Usage:
+# export COMET_JAR=/path/to/comet-spark-0.10.0.jar
+# docker compose -f benchmarks/tpc/infra/docker/docker-compose-laptop.yml up
-d
+#
+# Environment variables (set in .env or export before running):
+# BENCH_IMAGE - Docker image to use (default: comet-bench)
+# DATA_DIR - Host path to TPC data (default: /tmp/tpc-data)
+# RESULTS_DIR - Host path for results output (default:
/tmp/bench-results)
+# COMET_JAR - Host path to Comet JAR
+# GLUTEN_JAR - Host path to Gluten JAR
+# ICEBERG_JAR - Host path to Iceberg Spark runtime JAR
+# BENCH_JAVA_HOME - Java home inside container (default:
/usr/lib/jvm/java-17-openjdk)
+# Set to /usr/lib/jvm/java-8-openjdk for Gluten
+
+x-volumes: &volumes
+ - ${DATA_DIR:-/tmp/tpc-data}:/data:ro
+ - ${RESULTS_DIR:-/tmp/bench-results}:/results
+ - ${COMET_JAR:-/dev/null}:/jars/comet.jar:ro
+ - ${GLUTEN_JAR:-/dev/null}:/jars/gluten.jar:ro
+ - ${ICEBERG_JAR:-/dev/null}:/jars/iceberg.jar:ro
+ - ${RESULTS_DIR:-/tmp/bench-results}/logs:/opt/spark/logs
+ - ${RESULTS_DIR:-/tmp/bench-results}/work:/opt/spark/work
+
+services:
+ spark-master:
+ image: ${BENCH_IMAGE:-comet-bench}
+ container_name: spark-master
+ hostname: spark-master
+ command: /opt/spark/sbin/start-master.sh --host spark-master
+ ports:
+ - "7077:7077"
+ - "8080:8080"
+ volumes: *volumes
+ environment:
+ - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+ - SPARK_MASTER_HOST=spark-master
+ - SPARK_NO_DAEMONIZE=true
+
+ spark-worker-1:
+ image: ${BENCH_IMAGE:-comet-bench}
+ container_name: spark-worker-1
+ hostname: spark-worker-1
+ depends_on:
+ - spark-master
+ command: /opt/spark/sbin/start-worker.sh spark://spark-master:7077
+ ports:
+ - "8081:8081"
+ volumes: *volumes
+ environment:
+ - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+ - SPARK_WORKER_CORES=4
+ - SPARK_WORKER_MEMORY=4g
+ - SPARK_NO_DAEMONIZE=true
+ mem_limit: 8g
+ memswap_limit: 8g
+
+ bench:
+ image: ${BENCH_IMAGE:-comet-bench}
+ container_name: bench-runner
+ depends_on:
+ - spark-master
+ - spark-worker-1
+ # Override 'command' to run a specific benchmark, e.g.:
+ # docker compose run bench python3 /opt/benchmarks/run.py \
+ # --engine comet --benchmark tpch --no-restart
+ command: ["echo", "Use 'docker compose run bench python3
/opt/benchmarks/run.py ...' to run benchmarks"]
+ volumes: *volumes
+ environment:
+ - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+ - SPARK_HOME=/opt/spark
+ - SPARK_MASTER=spark://spark-master:7077
+ - COMET_JAR=/jars/comet.jar
+ - GLUTEN_JAR=/jars/gluten.jar
+ - ICEBERG_JAR=/jars/iceberg.jar
+ - TPCH_DATA=/data
+ - TPCDS_DATA=/data
+ mem_limit: 4g
+ memswap_limit: 4g
diff --git a/benchmarks/tpc/infra/docker/docker-compose.yml
b/benchmarks/tpc/infra/docker/docker-compose.yml
new file mode 100644
index 000000000..cca8cffa1
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/docker-compose.yml
@@ -0,0 +1,131 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Spark standalone cluster for TPC benchmarks.
+#
+# Two workers are used so that shuffles go through the network stack,
+# which better reflects real cluster behavior.
+#
+# Usage:
+# export COMET_JAR=/path/to/comet-spark-0.10.0.jar
+# docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml up -d
+#
+# Environment variables (set in .env or export before running):
+# BENCH_IMAGE - Docker image to use (default: comet-bench)
+# DATA_DIR - Host path to TPC data (default: /tmp/tpc-data)
+# RESULTS_DIR - Host path for results output (default:
/tmp/bench-results)
+# COMET_JAR - Host path to Comet JAR
+# GLUTEN_JAR - Host path to Gluten JAR
+# ICEBERG_JAR - Host path to Iceberg Spark runtime JAR
+# WORKER_MEM_LIMIT - Hard memory limit per worker container (default: 32g)
+# BENCH_MEM_LIMIT - Hard memory limit for the bench runner (default: 10g)
+# BENCH_JAVA_HOME - Java home inside container (default:
/usr/lib/jvm/java-17-openjdk)
+# Set to /usr/lib/jvm/java-8-openjdk for Gluten
+
+x-volumes: &volumes
+ - ${DATA_DIR:-/tmp/tpc-data}:/data:ro
+ - ${RESULTS_DIR:-/tmp/bench-results}:/results
+ - ${COMET_JAR:-/dev/null}:/jars/comet.jar:ro
+ - ${GLUTEN_JAR:-/dev/null}:/jars/gluten.jar:ro
+ - ${ICEBERG_JAR:-/dev/null}:/jars/iceberg.jar:ro
+ - ${RESULTS_DIR:-/tmp/bench-results}/logs:/opt/spark/logs
+ - ${RESULTS_DIR:-/tmp/bench-results}/work:/opt/spark/work
+
+x-worker: &worker
+ image: ${BENCH_IMAGE:-comet-bench}
+ depends_on:
+ - spark-master
+ command: /opt/spark/sbin/start-worker.sh spark://spark-master:7077
+ volumes: *volumes
+ environment:
+ - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+ - SPARK_WORKER_CORES=${WORKER_CORES:-8}
+ - SPARK_WORKER_MEMORY=${WORKER_MEMORY:-16g}
+ - SPARK_NO_DAEMONIZE=true
+ mem_limit: ${WORKER_MEM_LIMIT:-32g}
+ memswap_limit: ${WORKER_MEM_LIMIT:-32g}
+
+services:
+ spark-master:
+ image: ${BENCH_IMAGE:-comet-bench}
+ container_name: spark-master
+ hostname: spark-master
+ command: /opt/spark/sbin/start-master.sh --host spark-master
+ ports:
+ - "7077:7077"
+ - "8080:8080"
+ volumes: *volumes
+ environment:
+ - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+ - SPARK_MASTER_HOST=spark-master
+ - SPARK_NO_DAEMONIZE=true
+
+ spark-worker-1:
+ <<: *worker
+ container_name: spark-worker-1
+ hostname: spark-worker-1
+ ports:
+ - "8081:8081"
+
+ spark-worker-2:
+ <<: *worker
+ container_name: spark-worker-2
+ hostname: spark-worker-2
+ ports:
+ - "8082:8081"
+
+ bench:
+ image: ${BENCH_IMAGE:-comet-bench}
+ container_name: bench-runner
+ depends_on:
+ - spark-master
+ - spark-worker-1
+ - spark-worker-2
+ # Override 'command' to run a specific benchmark, e.g.:
+ # docker compose run bench python3 /opt/benchmarks/run.py \
+ # --engine comet --benchmark tpch --no-restart
+ command: ["echo", "Use 'docker compose run bench python3
/opt/benchmarks/run.py ...' to run benchmarks"]
+ volumes: *volumes
+ environment:
+ - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+ - SPARK_HOME=/opt/spark
+ - SPARK_MASTER=spark://spark-master:7077
+ - COMET_JAR=/jars/comet.jar
+ - GLUTEN_JAR=/jars/gluten.jar
+ - ICEBERG_JAR=/jars/iceberg.jar
+ - TPCH_DATA=/data
+ - TPCDS_DATA=/data
+ mem_limit: ${BENCH_MEM_LIMIT:-10g}
+ memswap_limit: ${BENCH_MEM_LIMIT:-10g}
+
+ # Uncomment to enable the Spark History Server for inspecting completed
+ # benchmark runs at http://localhost:18080. Requires event logs in
+ # $RESULTS_DIR/spark-events (created by `mkdir -p $RESULTS_DIR/spark-events`
+ # before starting the cluster).
+ #
+ # history-server:
+ # image: ${BENCH_IMAGE:-comet-bench}
+ # container_name: spark-history
+ # hostname: spark-history
+ # command: /opt/spark/sbin/start-history-server.sh
+ # ports:
+ # - "18080:18080"
+ # volumes:
+ # - ${RESULTS_DIR:-/tmp/bench-results}:/results:ro
+ # environment:
+ # - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+ # -
SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/results/spark-events
+ # - SPARK_NO_DAEMONIZE=true
+
diff --git a/benchmarks/tpc/run.py b/benchmarks/tpc/run.py
index 223a7d08e..d98d1693a 100755
--- a/benchmarks/tpc/run.py
+++ b/benchmarks/tpc/run.py
@@ -110,6 +110,7 @@ COMMON_SPARK_CONF = {
"spark.memory.offHeap.enabled": "true",
"spark.memory.offHeap.size": "16g",
"spark.eventLog.enabled": "true",
+ "spark.eventLog.dir": "/results/spark-events",
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.aws.credentials.provider":
"com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
}
@@ -120,9 +121,9 @@ COMMON_SPARK_CONF = {
BENCHMARK_PROFILES = {
"tpch": {
- "executor_instances": "1",
+ "executor_instances": "2",
"executor_cores": "8",
- "max_cores": "8",
+ "max_cores": "16",
"data_env": "TPCH_DATA",
"format": "parquet",
},
@@ -280,10 +281,6 @@ def build_spark_submit_cmd(config, benchmark, args):
data_val = os.environ.get(data_var, "")
cmd += ["--data", data_val]
- script_dir = os.path.dirname(os.path.abspath(__file__))
- queries_path = os.path.join(script_dir, "queries", benchmark)
- cmd += ["--queries", queries_path]
-
cmd += ["--output", args.output]
cmd += ["--iterations", str(args.iterations)]
diff --git a/benchmarks/tpc/tpcbench.py b/benchmarks/tpc/tpcbench.py
index 400ccd175..f043afb1c 100644
--- a/benchmarks/tpc/tpcbench.py
+++ b/benchmarks/tpc/tpcbench.py
@@ -26,6 +26,7 @@ Supports two data sources:
import argparse
from datetime import datetime
import json
+import os
from pyspark.sql import SparkSession
import time
from typing import Dict
@@ -50,18 +51,21 @@ def main(
data_path: str,
catalog: str,
database: str,
- query_path: str,
iterations: int,
output: str,
name: str,
format: str,
query_num: int = None,
write_path: str = None,
- options: Dict[str, str] = None
+ options: Dict[str, str] = None,
):
if options is None:
options = {}
+ query_path = os.path.join(
+ os.path.dirname(os.path.abspath(__file__)), "queries", benchmark
+ )
+
spark = SparkSession.builder \
.appName(f"{name} benchmark derived from {benchmark}") \
.getOrCreate()
@@ -94,7 +98,10 @@ def main(
print(f"Registering table {table} from {source}")
df = spark.table(source)
else:
+ # Support both "customer/" and "customer.parquet/" layouts
source = f"{data_path}/{table}.{format}"
+ if not os.path.exists(source):
+ source = f"{data_path}/{table}"
print(f"Registering table {table} from {source}")
df = spark.read.format(format).options(**options).load(source)
df.createOrReplaceTempView(table)
@@ -104,7 +111,6 @@ def main(
results = {
'engine': 'datafusion-comet',
'benchmark': benchmark,
- 'query_path': query_path,
'spark_conf': conf_dict,
}
if using_iceberg:
@@ -215,10 +221,6 @@ if __name__ == "__main__":
help="Database containing TPC tables (only used with --catalog)"
)
- parser.add_argument(
- "--queries", required=True,
- help="Path to query SQL files"
- )
parser.add_argument(
"--iterations", type=int, default=1,
help="Number of iterations"
@@ -246,12 +248,11 @@ if __name__ == "__main__":
args.data,
args.catalog,
args.database,
- args.queries,
args.iterations,
args.output,
args.name,
args.format,
args.query,
args.write,
- args.options
+ args.options,
)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]