(datafusion-comet) branch main updated: chore: Add Docker Compose support for TPC benchmarks (#3576)

agrove Tue, 24 Feb 2026 08:08:38 -0800

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git



The following commit(s) were added to refs/heads/main by this push:
     new a01774ff6 chore: Add Docker Compose support for TPC benchmarks (#3576)
a01774ff6 is described below

commit a01774ff6949b6f19084e02336a42da1d50175e2
Author: Andy Grove <[email protected]>
AuthorDate: Tue Feb 24 09:08:25 2026 -0700

    chore: Add Docker Compose support for TPC benchmarks (#3576)
    
    * Add Docker Compose support for TPC benchmarks
    
    Add Docker Compose setup for running TPC-H/TPC-DS benchmarks in an
    isolated Spark standalone cluster with two workers. Bundle TPC query
    SQL files in the repository, removing the need for external
    TPCH_QUERIES/TPCDS_QUERIES environment variables. Add
    Dockerfile.build-comet for cross-compiling Comet JARs with Linux
    native libraries on macOS.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * Add lightweight Docker Compose file for laptop benchmarking
    
    Add docker-compose-laptop.yml with a single worker (~12 GB total) for
    SF1-SF10 testing on laptops, replacing the --scale/env var workaround.
    Update README to document both compose files side by side.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * revert a docs change
    
    * Fix undefined variable warning in Dockerfile.build-comet
    
    Remove self-reference to LD_LIBRARY_PATH since it is not defined
    earlier in the Dockerfile, causing a Docker build warning.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix
    
    * prettier
    
    * fix
    
    * prettier
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 <[email protected]>
---
 .gitignore                                         |   2 +
 benchmarks/tpc/README.md                           | 178 ++++++++++++++++++++-
 benchmarks/tpc/infra/docker/Dockerfile             |  58 +++++++
 benchmarks/tpc/infra/docker/Dockerfile.build-comet |  76 +++++++++
 .../tpc/infra/docker/docker-compose-laptop.yml     |  97 +++++++++++
 benchmarks/tpc/infra/docker/docker-compose.yml     | 131 +++++++++++++++
 benchmarks/tpc/run.py                              |   9 +-
 benchmarks/tpc/tpcbench.py                         |  19 +--
 8 files changed, 550 insertions(+), 20 deletions(-)

diff --git a/.gitignore b/.gitignore
index 05b37627b..15cac247e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -17,8 +17,10 @@ filtered_rat.txt
 dev/dist
 apache-rat-*.jar
 venv
+.venv
 dev/release/comet-rm/workdir
 spark/benchmarks
 .DS_Store
 comet-event-trace.json
 __pycache__
+output
diff --git a/benchmarks/tpc/README.md b/benchmarks/tpc/README.md
index 779ad1753..eb1fb0479 100644
--- a/benchmarks/tpc/README.md
+++ b/benchmarks/tpc/README.md
@@ -26,6 +26,10 @@ For full instructions on running these benchmarks on an EC2 
instance, see the [C
 
 [Comet Benchmarking on EC2 Guide]: 
https://datafusion.apache.org/comet/contributor-guide/benchmarking_aws_ec2.html
 
+## Setup
+
+TPC queries are bundled in `benchmarks/tpc/queries/` (derived from TPC-H/DS 
under the TPC Fair Use Policy).
+
 ## Usage
 
 All benchmarks are run via `run.py`:
@@ -55,10 +59,9 @@ export SPARK_HOME=/opt/spark-3.5.3-bin-hadoop3/
 export SPARK_MASTER=spark://yourhostname:7077
 ```
 
-Set path to queries and data:
+Set path to data (TPC queries are bundled in `benchmarks/tpc/queries/`):
 
 ```shell
-export TPCH_QUERIES=/mnt/bigdata/tpch/queries/
 export TPCH_DATA=/mnt/bigdata/tpch/sf100/
 ```
 
@@ -135,9 +138,9 @@ $SPARK_HOME/bin/spark-submit \
     --master $SPARK_MASTER \
     --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 \
     --conf spark.driver.memory=8G \
-    --conf spark.executor.instances=1 \
+    --conf spark.executor.instances=2 \
     --conf spark.executor.cores=8 \
-    --conf spark.cores.max=8 \
+    --conf spark.cores.max=16 \
     --conf spark.executor.memory=16g \
     create-iceberg-tables.py \
     --benchmark tpch \
@@ -166,7 +169,6 @@ export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
 export COMET_JAR=/opt/comet/comet-spark-spark3.5_2.12-0.10.0.jar
 export ICEBERG_JAR=/path/to/iceberg-spark-runtime-3.5_2.12-1.8.1.jar
 export ICEBERG_WAREHOUSE=/mnt/bigdata/iceberg-warehouse
-export TPCH_QUERIES=/mnt/bigdata/tpch/queries/
 sudo ./drop-caches.sh
 python3 run.py --engine comet-iceberg --benchmark tpch
 ```
@@ -185,6 +187,172 @@ physical plan output.
 | `--catalog`      | No       | `local`        | Iceberg catalog name          
      |
 | `--database`     | No       | benchmark name | Database name for the tables  
      |
 
+## Running with Docker
+
+A Docker Compose setup is provided in `infra/docker/` for running benchmarks 
in an isolated
+Spark standalone cluster. The Docker image supports both **Linux (amd64)** and 
**macOS (arm64)**
+via architecture-agnostic Java symlinks created at build time.
+
+### Build the image
+
+The image must be built for the correct platform to match the native libraries 
in the
+engine JARs (e.g. Comet bundles `libcomet.so` for a specific OS/arch).
+
+```shell
+docker build -t comet-bench -f benchmarks/tpc/infra/docker/Dockerfile .
+```
+
+### Building a compatible Comet JAR
+
+The Comet JAR contains platform-specific native libraries (`libcomet.so` / 
`libcomet.dylib`).
+A JAR built on the host may not work inside the Docker container due to OS, 
architecture,
+or glibc version mismatches. Use `Dockerfile.build-comet` to build a JAR with 
compatible
+native libraries:
+
+- **macOS (Apple Silicon):** The host JAR contains `darwin/aarch64` libraries 
which
+  won't work in Linux containers. You **must** use the build Dockerfile.
+- **Linux:** If your host glibc version differs from the container's, the 
native library
+  will fail to load with a `GLIBC_x.xx not found` error. The build Dockerfile 
uses
+  Ubuntu 20.04 (glibc 2.31) for broad compatibility. Use it if you see
+  `UnsatisfiedLinkError` mentioning glibc when running benchmarks.
+
+```shell
+mkdir -p output
+docker build -t comet-builder \
+    -f benchmarks/tpc/infra/docker/Dockerfile.build-comet .
+docker run --rm -v $(pwd)/output:/output comet-builder
+export COMET_JAR=$(pwd)/output/comet-spark-spark3.5_2.12-*.jar
+```
+
+### Platform notes
+
+**macOS (Apple Silicon):** Docker Desktop is required.
+
+- **Memory:** Docker Desktop defaults to a small memory allocation (often 8 
GB) which
+  is not enough for Spark benchmarks. Go to **Docker Desktop > Settings > 
Resources >
+  Memory** and increase it to at least 48 GB (each worker requests 16 GB for 
its executor
+  plus overhead, and the driver needs 8 GB). Without enough memory, executors 
will be
+  OOM-killed (exit code 137).
+- **File Sharing:** You may need to add your data directory (e.g. `/opt`) to
+  **Docker Desktop > Settings > Resources > File Sharing** before mounting 
host volumes.
+
+**Linux (amd64):** Docker uses cgroup memory limits directly without a VM 
layer. No
+special Docker configuration is needed, but you may still need to build the 
Comet JAR
+using `Dockerfile.build-comet` (see above) if your host glibc version doesn't 
match
+the container's.
+
+The Docker image auto-detects the container architecture (amd64/arm64) and 
sets up
+arch-agnostic Java symlinks. The compose file uses `BENCH_JAVA_HOME` (not 
`JAVA_HOME`)
+to avoid inheriting the host's Java path into the container.
+
+### Start the cluster
+
+Set environment variables pointing to your host paths, then start the Spark 
master and
+two workers:
+
+```shell
+export DATA_DIR=/mnt/bigdata/tpch/sf100
+export RESULTS_DIR=/tmp/bench-results
+export COMET_JAR=/opt/comet/comet-spark-spark3.5_2.12-0.10.0.jar
+
+mkdir -p $RESULTS_DIR/spark-events
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml up -d
+```
+
+Set `COMET_JAR`, `GLUTEN_JAR`, or `ICEBERG_JAR` to the host path of the engine 
JAR you
+want to use. Each JAR is mounted individually into the container, so you can 
easily switch
+between versions by changing the path and restarting.
+
+### Run benchmarks
+
+Use `docker compose run --rm` to execute benchmarks. The `--rm` flag removes 
the
+container when it exits, preventing port conflicts on subsequent runs. Pass
+`--no-restart` since the cluster is already managed by Compose, and `--output 
/results`
+so that output files land in the mounted results directory:
+
+```shell
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml \
+    run --rm -p 4040:4040 bench \
+    python3 /opt/benchmarks/run.py \
+    --engine comet --benchmark tpch --output /results --no-restart
+```
+
+The `-p 4040:4040` flag exposes the Spark Application UI on the host. The 
following
+UIs are available during a benchmark run:
+
+| UI                | URL                    |
+| ----------------- | ---------------------- |
+| Spark Master      | http://localhost:8080  |
+| Worker 1          | http://localhost:8081  |
+| Worker 2          | http://localhost:8082  |
+| Spark Application | http://localhost:4040  |
+| History Server    | http://localhost:18080 |
+
+> **Note:** The Master UI links to the Application UI using the container's 
internal
+> hostname, which is not reachable from the host. Use `http://localhost:4040` 
directly
+> to access the Application UI.
+
+The Spark Application UI is only available while a benchmark is running. To 
inspect
+completed runs, uncomment the `history-server` service in `docker-compose.yml` 
and
+restart the cluster. The History Server reads event logs from 
`$RESULTS_DIR/spark-events`.
+
+For Gluten (requires Java 8), you must restart the **entire cluster** with 
`JAVA_HOME`
+set so that all services (master, workers, and bench) use Java 8:
+
+```shell
+export BENCH_JAVA_HOME=/usr/lib/jvm/java-8-openjdk
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml down
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml up -d
+
+docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml \
+    run --rm bench \
+    python3 /opt/benchmarks/run.py \
+    --engine gluten --benchmark tpch --output /results --no-restart
+```
+
+> **Important:** Only passing `-e JAVA_HOME=...` to the `bench` container is 
not
+> sufficient -- the workers also need Java 8 or Gluten will fail at runtime 
with
+> `sun.misc.Unsafe` errors. Unset `BENCH_JAVA_HOME` (or switch it back to Java 
17)
+> and restart the cluster before running Comet or Spark benchmarks.
+
+### Memory limits
+
+Two compose files are provided for different hardware profiles:
+
+| File                        | Workers | Total memory | Use case              
         |
+| --------------------------- | ------- | ------------ | 
------------------------------ |
+| `docker-compose.yml`        | 2       | ~74 GB       | SF100+ on a 
workstation/server |
+| `docker-compose-laptop.yml` | 1       | ~12 GB       | SF1–SF10 on a laptop  
         |
+
+**`docker-compose.yml`** (workstation default):
+
+| Container      | Container limit (`mem_limit`) | Spark JVM allocation      |
+| -------------- | ----------------------------- | ------------------------- |
+| spark-worker-1 | 32 GB                         | 16 GB executor + overhead |
+| spark-worker-2 | 32 GB                         | 16 GB executor + overhead |
+| bench (driver) | 10 GB                         | 8 GB driver               |
+| **Total**      | **74 GB**                     |                           |
+
+Configure via environment variables: `WORKER_MEM_LIMIT` (default: 32g per 
worker),
+`BENCH_MEM_LIMIT` (default: 10g), `WORKER_MEMORY` (default: 16g, Spark 
executor memory),
+`WORKER_CORES` (default: 8).
+
+### Running on a laptop with small scale factors
+
+For local development or testing with small scale factors (e.g. SF1 or SF10), 
use the
+laptop compose file which runs a single worker with reduced memory:
+
+```shell
+docker compose -f benchmarks/tpc/infra/docker/docker-compose-laptop.yml up -d
+```
+
+This starts one worker (4 GB executor inside an 8 GB container) and a 4 GB 
bench
+container, totaling approximately **12 GB** of memory.
+
+The benchmark scripts request 2 executor instances and 16 max cores by default
+(`run.py`). Spark will simply use whatever resources are available on the 
single worker,
+so no script changes are needed.
+
 ### Comparing Parquet vs Iceberg performance
 
 Run both benchmarks and compare:
diff --git a/benchmarks/tpc/infra/docker/Dockerfile 
b/benchmarks/tpc/infra/docker/Dockerfile
new file mode 100644
index 000000000..60567536a
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/Dockerfile
@@ -0,0 +1,58 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Benchmark image for running TPC-H and TPC-DS benchmarks across engines
+# (Spark, Comet, Gluten).
+#
+# Build (from repository root):
+#   docker build -t comet-bench -f benchmarks/tpc/infra/docker/Dockerfile .
+
+ARG SPARK_IMAGE=apache/spark:3.5.2-python3
+FROM ${SPARK_IMAGE}
+
+USER root
+
+# Install Java 8 (Gluten) and Java 17 (Comet) plus Python 3.
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+       openjdk-8-jdk-headless \
+       openjdk-17-jdk-headless \
+       python3 python3-pip procps \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+
+# Default to Java 17 (override with JAVA_HOME at runtime for Gluten).
+# Detect architecture (amd64 or arm64) so the image works on both Linux and 
macOS.
+ARG TARGETARCH
+RUN ln -s /usr/lib/jvm/java-17-openjdk-${TARGETARCH} 
/usr/lib/jvm/java-17-openjdk && \
+    ln -s /usr/lib/jvm/java-8-openjdk-${TARGETARCH} /usr/lib/jvm/java-8-openjdk
+ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+
+# Copy the benchmark scripts into the image.
+COPY benchmarks/tpc/run.py              /opt/benchmarks/run.py
+COPY benchmarks/tpc/tpcbench.py         /opt/benchmarks/tpcbench.py
+COPY benchmarks/tpc/engines             /opt/benchmarks/engines
+COPY benchmarks/tpc/queries             /opt/benchmarks/queries
+COPY benchmarks/tpc/create-iceberg-tables.py 
/opt/benchmarks/create-iceberg-tables.py
+COPY benchmarks/tpc/generate-comparison.py   
/opt/benchmarks/generate-comparison.py
+
+# Engine JARs are bind-mounted or copied in at runtime via --jars.
+# Data and query paths are also bind-mounted.
+
+WORKDIR /opt/benchmarks
+
+# Defined in the base apache/spark image.
+ARG spark_uid
+USER ${spark_uid}
diff --git a/benchmarks/tpc/infra/docker/Dockerfile.build-comet 
b/benchmarks/tpc/infra/docker/Dockerfile.build-comet
new file mode 100644
index 000000000..af5a0257a
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/Dockerfile.build-comet
@@ -0,0 +1,76 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Build a Comet JAR with native libraries for the current platform.
+#
+# This is useful on macOS (Apple Silicon) where the host-built JAR contains
+# darwin/aarch64 native libraries but Docker containers need linux/aarch64.
+#
+# Usage (from repository root):
+#   docker build -t comet-builder -f 
benchmarks/tpc/infra/docker/Dockerfile.build-comet .
+#   docker run --rm -v $(pwd)/output:/output comet-builder
+#
+# The JAR is copied to ./output/ on the host.
+
+# Use Ubuntu 20.04 to match the GLIBC version (2.31) in apache/spark images.
+FROM ubuntu:20.04 AS builder
+
+ARG TARGETARCH
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Install build dependencies: Java 17, Maven wrapper prerequisites, GCC 11.
+# Ubuntu 20.04's default GCC 9 has a memcmp bug (GCC #95189) that breaks 
aws-lc-sys.
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        openjdk-17-jdk-headless \
+        curl ca-certificates git pkg-config \
+        libssl-dev unzip software-properties-common \
+    && add-apt-repository -y ppa:ubuntu-toolchain-r/test \
+    && apt-get update \
+    && apt-get install -y --no-install-recommends gcc-11 g++-11 make \
+    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 110 \
+    && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 110 \
+    && update-alternatives --install /usr/bin/cc cc /usr/bin/gcc-11 110 \
+    && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+# Install protoc 25.x (Ubuntu 22.04's protoc is too old for proto3 optional 
fields).
+ARG PROTOC_VERSION=25.6
+RUN ARCH=$(uname -m) && \
+    if [ "$ARCH" = "aarch64" ]; then PROTOC_ARCH="linux-aarch_64"; \
+    else PROTOC_ARCH="linux-x86_64"; fi && \
+    curl -sLO 
"https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-${PROTOC_ARCH}.zip";
 && \
+    unzip -o "protoc-${PROTOC_VERSION}-${PROTOC_ARCH}.zip" -d /usr/local 
bin/protoc && \
+    rm "protoc-${PROTOC_VERSION}-${PROTOC_ARCH}.zip" && \
+    protoc --version
+
+# Set JAVA_HOME and LD_LIBRARY_PATH so the Rust build can find libjvm.
+RUN ln -s /usr/lib/jvm/java-17-openjdk-${TARGETARCH} 
/usr/lib/jvm/java-17-openjdk
+ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+ENV LD_LIBRARY_PATH=${JAVA_HOME}/lib/server
+
+# Install Rust.
+RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
+ENV PATH="/root/.cargo/bin:${PATH}"
+
+WORKDIR /build
+
+# Copy the full source tree.
+COPY . .
+
+# Build native code + package the JAR (skip tests).
+RUN make release-nogit
+
+# The entrypoint copies the built JAR to /output (bind-mounted from host).
+RUN mkdir -p /output
+CMD ["sh", "-c", "cp spark/target/comet-spark-spark3.5_2.12-*-SNAPSHOT.jar 
/output/ && echo 'Comet JAR copied to /output/' && ls -lh /output/*.jar"]
diff --git a/benchmarks/tpc/infra/docker/docker-compose-laptop.yml 
b/benchmarks/tpc/infra/docker/docker-compose-laptop.yml
new file mode 100644
index 000000000..6c5d8dbaf
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/docker-compose-laptop.yml
@@ -0,0 +1,97 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lightweight Spark standalone cluster for TPC benchmarks on a laptop.
+#
+# Single worker, ~12 GB total memory. Suitable for SF1-SF10 testing.
+#
+# Usage:
+#   export COMET_JAR=/path/to/comet-spark-0.10.0.jar
+#   docker compose -f benchmarks/tpc/infra/docker/docker-compose-laptop.yml up 
-d
+#
+# Environment variables (set in .env or export before running):
+#   BENCH_IMAGE        - Docker image to use (default: comet-bench)
+#   DATA_DIR           - Host path to TPC data (default: /tmp/tpc-data)
+#   RESULTS_DIR        - Host path for results output (default: 
/tmp/bench-results)
+#   COMET_JAR          - Host path to Comet JAR
+#   GLUTEN_JAR         - Host path to Gluten JAR
+#   ICEBERG_JAR        - Host path to Iceberg Spark runtime JAR
+#   BENCH_JAVA_HOME    - Java home inside container (default: 
/usr/lib/jvm/java-17-openjdk)
+#                        Set to /usr/lib/jvm/java-8-openjdk for Gluten
+
+x-volumes: &volumes
+  - ${DATA_DIR:-/tmp/tpc-data}:/data:ro
+  - ${RESULTS_DIR:-/tmp/bench-results}:/results
+  - ${COMET_JAR:-/dev/null}:/jars/comet.jar:ro
+  - ${GLUTEN_JAR:-/dev/null}:/jars/gluten.jar:ro
+  - ${ICEBERG_JAR:-/dev/null}:/jars/iceberg.jar:ro
+  - ${RESULTS_DIR:-/tmp/bench-results}/logs:/opt/spark/logs
+  - ${RESULTS_DIR:-/tmp/bench-results}/work:/opt/spark/work
+
+services:
+  spark-master:
+    image: ${BENCH_IMAGE:-comet-bench}
+    container_name: spark-master
+    hostname: spark-master
+    command: /opt/spark/sbin/start-master.sh --host spark-master
+    ports:
+      - "7077:7077"
+      - "8080:8080"
+    volumes: *volumes
+    environment:
+      - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+      - SPARK_MASTER_HOST=spark-master
+      - SPARK_NO_DAEMONIZE=true
+
+  spark-worker-1:
+    image: ${BENCH_IMAGE:-comet-bench}
+    container_name: spark-worker-1
+    hostname: spark-worker-1
+    depends_on:
+      - spark-master
+    command: /opt/spark/sbin/start-worker.sh spark://spark-master:7077
+    ports:
+      - "8081:8081"
+    volumes: *volumes
+    environment:
+      - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+      - SPARK_WORKER_CORES=4
+      - SPARK_WORKER_MEMORY=4g
+      - SPARK_NO_DAEMONIZE=true
+    mem_limit: 8g
+    memswap_limit: 8g
+
+  bench:
+    image: ${BENCH_IMAGE:-comet-bench}
+    container_name: bench-runner
+    depends_on:
+      - spark-master
+      - spark-worker-1
+    # Override 'command' to run a specific benchmark, e.g.:
+    #   docker compose run bench python3 /opt/benchmarks/run.py \
+    #       --engine comet --benchmark tpch --no-restart
+    command: ["echo", "Use 'docker compose run bench python3 
/opt/benchmarks/run.py ...' to run benchmarks"]
+    volumes: *volumes
+    environment:
+      - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+      - SPARK_HOME=/opt/spark
+      - SPARK_MASTER=spark://spark-master:7077
+      - COMET_JAR=/jars/comet.jar
+      - GLUTEN_JAR=/jars/gluten.jar
+      - ICEBERG_JAR=/jars/iceberg.jar
+      - TPCH_DATA=/data
+      - TPCDS_DATA=/data
+    mem_limit: 4g
+    memswap_limit: 4g
diff --git a/benchmarks/tpc/infra/docker/docker-compose.yml 
b/benchmarks/tpc/infra/docker/docker-compose.yml
new file mode 100644
index 000000000..cca8cffa1
--- /dev/null
+++ b/benchmarks/tpc/infra/docker/docker-compose.yml
@@ -0,0 +1,131 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Spark standalone cluster for TPC benchmarks.
+#
+# Two workers are used so that shuffles go through the network stack,
+# which better reflects real cluster behavior.
+#
+# Usage:
+#   export COMET_JAR=/path/to/comet-spark-0.10.0.jar
+#   docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml up -d
+#
+# Environment variables (set in .env or export before running):
+#   BENCH_IMAGE        - Docker image to use (default: comet-bench)
+#   DATA_DIR           - Host path to TPC data (default: /tmp/tpc-data)
+#   RESULTS_DIR        - Host path for results output (default: 
/tmp/bench-results)
+#   COMET_JAR          - Host path to Comet JAR
+#   GLUTEN_JAR         - Host path to Gluten JAR
+#   ICEBERG_JAR        - Host path to Iceberg Spark runtime JAR
+#   WORKER_MEM_LIMIT   - Hard memory limit per worker container (default: 32g)
+#   BENCH_MEM_LIMIT    - Hard memory limit for the bench runner (default: 10g)
+#   BENCH_JAVA_HOME    - Java home inside container (default: 
/usr/lib/jvm/java-17-openjdk)
+#                        Set to /usr/lib/jvm/java-8-openjdk for Gluten
+
+x-volumes: &volumes
+  - ${DATA_DIR:-/tmp/tpc-data}:/data:ro
+  - ${RESULTS_DIR:-/tmp/bench-results}:/results
+  - ${COMET_JAR:-/dev/null}:/jars/comet.jar:ro
+  - ${GLUTEN_JAR:-/dev/null}:/jars/gluten.jar:ro
+  - ${ICEBERG_JAR:-/dev/null}:/jars/iceberg.jar:ro
+  - ${RESULTS_DIR:-/tmp/bench-results}/logs:/opt/spark/logs
+  - ${RESULTS_DIR:-/tmp/bench-results}/work:/opt/spark/work
+
+x-worker: &worker
+  image: ${BENCH_IMAGE:-comet-bench}
+  depends_on:
+    - spark-master
+  command: /opt/spark/sbin/start-worker.sh spark://spark-master:7077
+  volumes: *volumes
+  environment:
+    - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+    - SPARK_WORKER_CORES=${WORKER_CORES:-8}
+    - SPARK_WORKER_MEMORY=${WORKER_MEMORY:-16g}
+    - SPARK_NO_DAEMONIZE=true
+  mem_limit: ${WORKER_MEM_LIMIT:-32g}
+  memswap_limit: ${WORKER_MEM_LIMIT:-32g}
+
+services:
+  spark-master:
+    image: ${BENCH_IMAGE:-comet-bench}
+    container_name: spark-master
+    hostname: spark-master
+    command: /opt/spark/sbin/start-master.sh --host spark-master
+    ports:
+      - "7077:7077"
+      - "8080:8080"
+    volumes: *volumes
+    environment:
+      - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+      - SPARK_MASTER_HOST=spark-master
+      - SPARK_NO_DAEMONIZE=true
+
+  spark-worker-1:
+    <<: *worker
+    container_name: spark-worker-1
+    hostname: spark-worker-1
+    ports:
+      - "8081:8081"
+
+  spark-worker-2:
+    <<: *worker
+    container_name: spark-worker-2
+    hostname: spark-worker-2
+    ports:
+      - "8082:8081"
+
+  bench:
+    image: ${BENCH_IMAGE:-comet-bench}
+    container_name: bench-runner
+    depends_on:
+      - spark-master
+      - spark-worker-1
+      - spark-worker-2
+    # Override 'command' to run a specific benchmark, e.g.:
+    #   docker compose run bench python3 /opt/benchmarks/run.py \
+    #       --engine comet --benchmark tpch --no-restart
+    command: ["echo", "Use 'docker compose run bench python3 
/opt/benchmarks/run.py ...' to run benchmarks"]
+    volumes: *volumes
+    environment:
+      - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+      - SPARK_HOME=/opt/spark
+      - SPARK_MASTER=spark://spark-master:7077
+      - COMET_JAR=/jars/comet.jar
+      - GLUTEN_JAR=/jars/gluten.jar
+      - ICEBERG_JAR=/jars/iceberg.jar
+      - TPCH_DATA=/data
+      - TPCDS_DATA=/data
+    mem_limit: ${BENCH_MEM_LIMIT:-10g}
+    memswap_limit: ${BENCH_MEM_LIMIT:-10g}
+
+  # Uncomment to enable the Spark History Server for inspecting completed
+  # benchmark runs at http://localhost:18080.  Requires event logs in
+  # $RESULTS_DIR/spark-events (created by `mkdir -p $RESULTS_DIR/spark-events`
+  # before starting the cluster).
+  #
+  # history-server:
+  #   image: ${BENCH_IMAGE:-comet-bench}
+  #   container_name: spark-history
+  #   hostname: spark-history
+  #   command: /opt/spark/sbin/start-history-server.sh
+  #   ports:
+  #     - "18080:18080"
+  #   volumes:
+  #     - ${RESULTS_DIR:-/tmp/bench-results}:/results:ro
+  #   environment:
+  #     - JAVA_HOME=${BENCH_JAVA_HOME:-/usr/lib/jvm/java-17-openjdk}
+  #     - 
SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/results/spark-events
+  #     - SPARK_NO_DAEMONIZE=true
+
diff --git a/benchmarks/tpc/run.py b/benchmarks/tpc/run.py
index 223a7d08e..d98d1693a 100755
--- a/benchmarks/tpc/run.py
+++ b/benchmarks/tpc/run.py
@@ -110,6 +110,7 @@ COMMON_SPARK_CONF = {
     "spark.memory.offHeap.enabled": "true",
     "spark.memory.offHeap.size": "16g",
     "spark.eventLog.enabled": "true",
+    "spark.eventLog.dir": "/results/spark-events",
     "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
     "spark.hadoop.fs.s3a.aws.credentials.provider": 
"com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
 }
@@ -120,9 +121,9 @@ COMMON_SPARK_CONF = {
 
 BENCHMARK_PROFILES = {
     "tpch": {
-        "executor_instances": "1",
+        "executor_instances": "2",
         "executor_cores": "8",
-        "max_cores": "8",
+        "max_cores": "16",
         "data_env": "TPCH_DATA",
         "format": "parquet",
     },
@@ -280,10 +281,6 @@ def build_spark_submit_cmd(config, benchmark, args):
         data_val = os.environ.get(data_var, "")
         cmd += ["--data", data_val]
 
-    script_dir = os.path.dirname(os.path.abspath(__file__))
-    queries_path = os.path.join(script_dir, "queries", benchmark)
-    cmd += ["--queries", queries_path]
-
     cmd += ["--output", args.output]
     cmd += ["--iterations", str(args.iterations)]
 
diff --git a/benchmarks/tpc/tpcbench.py b/benchmarks/tpc/tpcbench.py
index 400ccd175..f043afb1c 100644
--- a/benchmarks/tpc/tpcbench.py
+++ b/benchmarks/tpc/tpcbench.py
@@ -26,6 +26,7 @@ Supports two data sources:
 import argparse
 from datetime import datetime
 import json
+import os
 from pyspark.sql import SparkSession
 import time
 from typing import Dict
@@ -50,18 +51,21 @@ def main(
     data_path: str,
     catalog: str,
     database: str,
-    query_path: str,
     iterations: int,
     output: str,
     name: str,
     format: str,
     query_num: int = None,
     write_path: str = None,
-    options: Dict[str, str] = None
+    options: Dict[str, str] = None,
 ):
     if options is None:
         options = {}
 
+    query_path = os.path.join(
+        os.path.dirname(os.path.abspath(__file__)), "queries", benchmark
+    )
+
     spark = SparkSession.builder \
         .appName(f"{name} benchmark derived from {benchmark}") \
         .getOrCreate()
@@ -94,7 +98,10 @@ def main(
             print(f"Registering table {table} from {source}")
             df = spark.table(source)
         else:
+            # Support both "customer/" and "customer.parquet/" layouts
             source = f"{data_path}/{table}.{format}"
+            if not os.path.exists(source):
+                source = f"{data_path}/{table}"
             print(f"Registering table {table} from {source}")
             df = spark.read.format(format).options(**options).load(source)
         df.createOrReplaceTempView(table)
@@ -104,7 +111,6 @@ def main(
     results = {
         'engine': 'datafusion-comet',
         'benchmark': benchmark,
-        'query_path': query_path,
         'spark_conf': conf_dict,
     }
     if using_iceberg:
@@ -215,10 +221,6 @@ if __name__ == "__main__":
         help="Database containing TPC tables (only used with --catalog)"
     )
 
-    parser.add_argument(
-        "--queries", required=True,
-        help="Path to query SQL files"
-    )
     parser.add_argument(
         "--iterations", type=int, default=1,
         help="Number of iterations"
@@ -246,12 +248,11 @@ if __name__ == "__main__":
         args.data,
         args.catalog,
         args.database,
-        args.queries,
         args.iterations,
         args.output,
         args.name,
         args.format,
         args.query,
         args.write,
-        args.options
+        args.options,
     )


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-comet) branch main updated: chore: Add Docker Compose support for TPC benchmarks (#3576)

Reply via email to