martin-g commented on code in PR #1881: URL: https://github.com/apache/datafusion-ballista/pull/1881#discussion_r3440916745
########## docs/source/user-guide/deployment/quick-start.md: ########## @@ -19,128 +19,167 @@ # Ballista Quickstart -A simple way to start a local cluster for testing purposes is to use cargo to build the project and then run the scheduler and executor binaries directly. +There are two ways to get a local Ballista cluster running. Choose based on your goal: -Project Requirements: +| | [Evaluate Ballista](#path-a-evaluate-with-docker-2-min) | [Build from source](#path-b-build-from-source-20-min) | +|---|---|---| +| Goal | Try Ballista against the last stable release | Develop or test against local code changes | +| Prerequisites | Docker | Rust, protoc | +| Cold start time | ~2 min (image pull) | ~20 min (full compile) | Review Comment: Again, the times here depend on your network speed and your hardware ... ########## docs/source/user-guide/deployment/docker-compose.md: ########## @@ -19,37 +19,51 @@ # Starting a Ballista Cluster using Docker Compose -Docker Compose is a convenient way to launch a cluster when testing locally. +Two Compose files are provided. Choose based on whether you need the last stable release +or want to run against local source changes. -## Build Docker Images +## Option 1: Pre-built images (no local build required) -To create the required Docker images please refer to the [docker deployment page](docker.md). +`docker-compose.quick.yml` pulls images directly from GHCR — no Rust toolchain needed. +Images are published on each stable release; `latest` tracks the most recent release, +not the `main` branch. -## Start a Cluster +```bash +docker compose -f docker-compose.quick.yml up +``` + +See the [quickstart guide](quick-start.md) for connection instructions and data volume setup. -Using the [docker-compose.yml](https://github.com/apache/datafusion-ballista/blob/main/docker-compose.yml) from the -source repository, run the following command to start a cluster: +## Option 2: Build from source + +`docker-compose.yml` builds executor and scheduler images from the local Dockerfiles. +The Dockerfiles copy pre-compiled binaries — they do **not** run `cargo build` themselves. +You must compile first: ```bash -docker-compose up --build +# Step 1 — compile (requires Rust + protoc, takes ~20 min cold) Review Comment: `~20 min` depends on your hardware. For me it takes 3 mins to `cargo clean && cargo build --release` the whole project. ########## docs/source/user-guide/deployment/docker-compose.md: ########## @@ -19,37 +19,51 @@ # Starting a Ballista Cluster using Docker Compose -Docker Compose is a convenient way to launch a cluster when testing locally. +Two Compose files are provided. Choose based on whether you need the last stable release +or want to run against local source changes. -## Build Docker Images +## Option 1: Pre-built images (no local build required) -To create the required Docker images please refer to the [docker deployment page](docker.md). +`docker-compose.quick.yml` pulls images directly from GHCR — no Rust toolchain needed. +Images are published on each stable release; `latest` tracks the most recent release, +not the `main` branch. -## Start a Cluster +```bash +docker compose -f docker-compose.quick.yml up +``` + +See the [quickstart guide](quick-start.md) for connection instructions and data volume setup. -Using the [docker-compose.yml](https://github.com/apache/datafusion-ballista/blob/main/docker-compose.yml) from the -source repository, run the following command to start a cluster: +## Option 2: Build from source + +`docker-compose.yml` builds executor and scheduler images from the local Dockerfiles. +The Dockerfiles copy pre-compiled binaries — they do **not** run `cargo build` themselves. +You must compile first: ```bash -docker-compose up --build +# Step 1 — compile (requires Rust + protoc, takes ~20 min cold) +cargo build --release + +# Step 2 — build Docker images and start the cluster +docker compose up --build ``` -This should show output similar to the following: +Skipping Step 1 will cause the build to fail because the `COPY target/release/ballista-*` +instruction in the Dockerfiles will find no binaries to copy. + +Expected output after a successful start: -```bash -$ docker-compose up -Creating network "ballista-benchmarks_default" with the default driver -Creating ballista-benchmarks_ballista-scheduler_1 ... done -Creating ballista-benchmarks_ballista-executor_1 ... done -Attaching to ballista-benchmarks_ballista-scheduler_1, ballista-benchmarks_ballista-executor_1 -ballista-scheduler_1 | INFO ballista_scheduler: Ballista v52.0.0 Scheduler listening on 0.0.0.0:50050 -ballista-executor_1 | INFO ballista_executor: Ballista v52.0.0 Rust Executor listening on 0.0.0.0:50051 +``` +ballista-scheduler_1 | Ballista Scheduler listening on 0.0.0.0:50050 Review Comment: Which version of Docker Compose do you use ? v2 prints the names with `...-N` suffix, not `..._N`. ########## docker-compose.quick.yml: ########## @@ -0,0 +1,82 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Quick-start cluster using pre-built images from GHCR. +# No local build required — Docker is the only prerequisite. +# +# Runs the last stable release. To test against unreleased changes, +# use docker-compose.yml (requires `cargo build --release` first). +# +# Usage: +# docker compose -f docker-compose.quick.yml up +# +# Connect from Rust: +# SessionContext::remote("df://localhost:50050").await? +# +# Connect from the CLI: +# cargo run -p ballista-cli -- --host localhost --port 50050 +# +# To make local data available inside executors, uncomment and +# set the volume path under ballista-executor: +# volumes: +# - /absolute/path/to/your/data:/data:ro + +services: + ballista-scheduler: + image: ghcr.io/apache/datafusion-ballista-scheduler:latest + # --advertise-flight-sql-endpoint enables the scheduler to proxy all + # result fetching so clients only ever connect to port 50050. + # Without this flag, clients would need direct access to each + # executor's Arrow Flight port, which breaks in Docker networking. + command: > + --bind-host 0.0.0.0 + --external-host ballista-scheduler + --advertise-flight-sql-endpoint + ports: + - "50050:50050" + environment: + - RUST_LOG=ballista=info,ballista_scheduler=info + healthcheck: + test: ["CMD", "bash", "-c", "</dev/tcp/127.0.0.1/50050"] + interval: 5s + timeout: 5s + retries: 10 + restart: "no" + + ballista-executor: + image: ghcr.io/apache/datafusion-ballista-executor:latest + command: > + --bind-host 0.0.0.0 + --scheduler-host ballista-scheduler + --concurrent-tasks 4 + --work-dir /work + environment: + - RUST_LOG=ballista=info,ballista_executor=info + # Uncomment to mount local data for queries: + # volumes: + # - /absolute/path/to/your/data:/data:ro + depends_on: + ballista-scheduler: + condition: service_healthy + healthcheck: + test: ["CMD", "bash", "-c", "</dev/tcp/127.0.0.1/50051"] + interval: 5s + timeout: 5s + retries: 10 + restart: "no" + deploy: + replicas: 2 Review Comment: This is taken into account only when Docker Swarm is used, no ? ########## docs/source/user-guide/deployment/quick-start.md: ########## @@ -19,128 +19,167 @@ # Ballista Quickstart -A simple way to start a local cluster for testing purposes is to use cargo to build the project and then run the scheduler and executor binaries directly. +There are two ways to get a local Ballista cluster running. Choose based on your goal: -Project Requirements: +| | [Evaluate Ballista](#path-a-evaluate-with-docker-2-min) | [Build from source](#path-b-build-from-source-20-min) | +|---|---|---| +| Goal | Try Ballista against the last stable release | Develop or test against local code changes | +| Prerequisites | Docker | Rust, protoc | +| Cold start time | ~2 min (image pull) | ~20 min (full compile) | +| Terminals needed | 1 | 3 | + +> [!IMPORTANT] +> Ballista and DataFusion are developed independently. A given Ballista release may not be compatible +> with the latest DataFusion version. Check the [compatibility matrix](../configs.md) before integrating. + +--- + +## Path A: Evaluate with Docker (~2 min) + +The only prerequisite is [Docker](https://docs.docker.com/get-docker/) with Compose v2. + +This uses pre-built images from GHCR that are published on each stable release. The `latest` tag +tracks the most recent release, not the `main` branch. + +```shell +docker compose -f docker-compose.quick.yml up +``` + +You should see output similar to: + +``` +ballista-scheduler-1 | Ballista Scheduler v53.0.0 listening on 0.0.0.0:50050 +ballista-executor-1 | Executor registration succeed +ballista-executor-2 | Executor registration succeed +``` + +Two executors start by default. The scheduler listens on `localhost:50050`. + +**Connect from Rust:** + +```rust +let ctx = SessionContext::remote("df://localhost:50050").await?; +``` + +**Connect from the CLI** (requires a local build — no pre-built CLI image is published): + +```shell +cargo run -p ballista-cli -- --host localhost --port 50050 +``` + +**To make local data available inside the executors**, uncomment and set the `volumes` block +in `docker-compose.quick.yml`: + +```yaml +ballista-executor: + volumes: + - /absolute/path/to/your/data:/data:ro +``` + +Then reference `/data/yourfile.parquet` in your queries. The path must be the same inside +every executor container. + +**Tear down:** + +```shell +docker compose -f docker-compose.quick.yml down +``` + +--- + +## Path B: Build from source (~20 min) + +Use this path if you need to test local code changes or run against the `main` branch. + +**Prerequisites:** - [Rust](https://www.rust-lang.org/tools/install) - [Protobuf Compiler](https://protobuf.dev/downloads/) -## Build the project - -From the root of the project, build release binaries. +**Step 1:** Build release binaries from the repository root: ```shell cargo build --release ``` -Start a Ballista scheduler process in a new terminal session. +**Step 2:** Start the scheduler in a new terminal: ```shell RUST_LOG=info ./target/release/ballista-scheduler ``` -Start one or more Ballista executor processes in new terminal sessions. When starting more than one -executor, a unique port number must be specified for each executor. +**Step 3:** Start one or more executors, each in a new terminal. When running multiple +executors, each needs a unique pair of ports: ```shell RUST_LOG=info ./target/release/ballista-executor -c 2 -p 50051 --bind-grpc-port 50052 +``` +```shell RUST_LOG=info ./target/release/ballista-executor -c 2 -p 50053 --bind-grpc-port 50054 ``` +> **Two-port model:** each executor exposes an Arrow Flight port (data, `-p`) and a gRPC +> control port (`--bind-grpc-port`). Both must be reachable by the scheduler. + +--- + ## Running the examples -The examples can be run using the `cargo run --bin` syntax. Open a new terminal session and run the following commands. +Examples live in the `examples/` directory and connect to `localhost:50050` by default. -### Distributed SQL Example +### Distributed SQL example ```bash cd examples cargo run --release --example remote-sql ``` -#### Source code for distributed SQL example +### Distributed DataFrame example -```rust -use ballista::prelude::*; -use ballista_examples::test_util; -use datafusion::{ - execution::SessionStateBuilder, - prelude::{CsvReadOptions, SessionConfig, SessionContext}, -}; - -/// This example demonstrates executing a simple query against an Arrow data source (CSV) and -/// fetching results, using SQL -#[tokio::main] -async fn main() -> Result<()> { - let config = SessionConfig::new_with_ballista() - .with_target_partitions(4) - .with_ballista_job_name("Remote SQL Example"); - - let state = SessionStateBuilder::new() - .with_config(config) - .with_default_features() - .build(); - - let ctx = SessionContext::remote_with_state("df://localhost:50050", state).await?; - - let test_data = test_util::examples_test_data(); - - ctx.register_csv( - "test", - &format!("{test_data}/aggregate_test_100.csv"), - CsvReadOptions::new(), - ) - .await?; - - let df = ctx - .sql( - "SELECT c1, MIN(c12), MAX(c12) \ - FROM test \ - WHERE c11 > 0.1 AND c11 < 0.9 \ - GROUP BY c1", - ) - .await?; - - df.show().await?; - - Ok(()) -} +```bash +cd examples +cargo run --release --example remote-dataframe ``` -### Distributed DataFrame Example +### Standalone (single-process) example + +No cluster needed — scheduler and executor run in the same process: ```bash cd examples -cargo run --release --example remote-dataframe +cargo run --release --example standalone-sql ``` -#### Source code for distributed DataFrame example Review Comment: I found this source code useful. Maybe replace it with a link to the standalone-sql example ?! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
