This is an automated email from the ASF dual-hosted git repository.
milenkovicm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-ballista.git
The following commit(s) were added to refs/heads/main by this push:
new 5ec9ae6b docs: fix outdated content in documentation (#1385)
5ec9ae6b is described below
commit 5ec9ae6b0b5855e7744aa93e6a9acfee25f5ceab
Author: Andy Grove <[email protected]>
AuthorDate: Sat Jan 17 12:18:56 2026 -0700
docs: fix outdated content in documentation (#1385)
* docs: fix outdated content in documentation
- Remove outdated etcd references (etcd backend was removed)
- Update version numbers from old versions to v51.0.0
- Fix executor-slots-policy to task-distribution with correct values
- Remove Sled database references from docker.md
- Update kubernetes.md docker tags and log output format
- Fix Python API: Ballista() -> BallistaBuilder()
- Fix scheduler-policy parameter name and default value
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* chore: add CLAUDE.md to .gitignore
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* style: format markdown with prettier
Co-Authored-By: Claude Opus 4.5 <[email protected]>
* docs: add benchmarking section to contributors guide
Link to benchmarks/README.md for TPC-H and performance testing instructions.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
---------
Co-authored-by: Claude Opus 4.5 <[email protected]>
---
.gitignore | 3 +++
docs/developer/architecture.md | 2 +-
docs/source/contributors-guide/architecture.md | 3 ---
docs/source/contributors-guide/development.md | 11 +++++++++++
docs/source/user-guide/cli.md | 2 +-
docs/source/user-guide/configs.md | 17 ++++++++---------
.../source/user-guide/deployment/docker-compose.md | 10 +++-------
docs/source/user-guide/deployment/docker.md | 21 +++++++++------------
docs/source/user-guide/deployment/kubernetes.md | 22 +++++++++++-----------
docs/source/user-guide/python.md | 11 ++++-------
docs/source/user-guide/tuning-guide.md | 4 ++--
11 files changed, 53 insertions(+), 53 deletions(-)
diff --git a/.gitignore b/.gitignore
index db1b4fe6..9552147c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -102,3 +102,6 @@ dev/dist
# logs
logs/
+
+# Claude Code guidance file (local only)
+CLAUDE.md
diff --git a/docs/developer/architecture.md b/docs/developer/architecture.md
index d49e5013..3e00346b 100644
--- a/docs/developer/architecture.md
+++ b/docs/developer/architecture.md
@@ -60,7 +60,7 @@ The scheduler process implements a gRPC interface (defined in
| GetJobStatus | Get the status of a submitted query
|
| RegisterExecutor | Executors call this method to register themselves
with the scheduler |
-The scheduler can run in standalone mode, or can be run in clustered mode
using etcd as backing store for state.
+The scheduler currently uses in-memory state storage.
## Executor Process
diff --git a/docs/source/contributors-guide/architecture.md
b/docs/source/contributors-guide/architecture.md
index b0e00730..b25321d0 100644
--- a/docs/source/contributors-guide/architecture.md
+++ b/docs/source/contributors-guide/architecture.md
@@ -80,9 +80,6 @@ plan or a SQL string. The scheduler then creates an execution
graph, which conta
stages (pipelines) that can be scheduled independently. This process is
explained in detail in the Distributed
Query Scheduling section of this guide.
-It is possible to have multiple schedulers running with shared state in etcd,
so that jobs can continue to run
-even if a scheduler process fails.
-
### Executor
The executor processes connect to a scheduler and poll for tasks to perform.
These tasks are physical plans in
diff --git a/docs/source/contributors-guide/development.md
b/docs/source/contributors-guide/development.md
index a21595b1..feefc8a6 100644
--- a/docs/source/contributors-guide/development.md
+++ b/docs/source/contributors-guide/development.md
@@ -65,3 +65,14 @@ cargo test
cd examples
cargo run --example standalone_sql --features=ballista/standalone
```
+
+## Benchmarking
+
+For performance testing and benchmarking with TPC-H and other datasets, see
the [benchmarks README](../../../benchmarks/README.md).
+
+This includes instructions for:
+
+- Generating TPC-H test data
+- Running benchmarks against DataFusion and Ballista
+- Comparing performance with Apache Spark
+- Running load tests
diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md
index 213f6034..597bc195 100644
--- a/docs/source/user-guide/cli.md
+++ b/docs/source/user-guide/cli.md
@@ -71,7 +71,7 @@ It is also possible to run the CLI in standalone mode, where
it will create a sc
```bash
$ ballista-cli
-Ballista CLI v8.0.0
+Ballista CLI v51.0.0
> CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
0 rows in set. Query took 0.001 seconds.
diff --git a/docs/source/user-guide/configs.md
b/docs/source/user-guide/configs.md
index 5e909e15..56b847e1 100644
--- a/docs/source/user-guide/configs.md
+++ b/docs/source/user-guide/configs.md
@@ -96,14 +96,13 @@ manage the whole cluster are also needed to be taken care
of.
_Example: Specifying configuration options when starting the scheduler_
```shell
-./ballista-scheduler --scheduler-policy push-staged --event-loop-buffer-size
1000000 --executor-slots-policy
-round-robin-local
+./ballista-scheduler --scheduler-policy push-staged --event-loop-buffer-size
1000000 --task-distribution round-robin
```
-| key | type | default |
description
|
-| -------------------------------------------- | ------ | ----------- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| scheduler-policy | Utf8 | pull-staged | Sets
the task scheduling policy for the scheduler, possible values: pull-staged,
push-staged.
|
-| event-loop-buffer-size | UInt32 | 10000 | Sets
the event loop buffer size. for a system of high throughput, a larger value
like 1000000 is recommended.
|
-| executor-slots-policy | Utf8 | bias | Sets
the executor slots policy for the scheduler, possible values: bias,
round-robin, round-robin-local. For a cluster with single scheduler,
round-robin-local is recommended. |
-| finished-job-data-clean-up-interval-seconds | UInt64 | 300 | Sets
the delayed interval for cleaning up finished job data, mainly the shuffle
data, 0 means the cleaning up is disabled.
|
-| finished-job-state-clean-up-interval-seconds | UInt64 | 3600 | Sets
the delayed interval for cleaning up finished job state stored in the backend,
0 means the cleaning up is disabled.
|
+| key | type | default |
description
|
+| -------------------------------------------- | ------ | ----------- |
--------------------------------------------------------------------------------------------------------------------------
|
+| scheduler-policy | Utf8 | pull-staged | Sets
the task scheduling policy for the scheduler, possible values: pull-staged,
push-staged. |
+| event-loop-buffer-size | UInt32 | 10000 | Sets
the event loop buffer size. for a system of high throughput, a larger value
like 1000000 is recommended. |
+| task-distribution | Utf8 | bias | Sets
the task distribution policy for the scheduler, possible values: bias,
round-robin, consistent-hash. |
+| finished-job-data-clean-up-interval-seconds | UInt64 | 300 | Sets
the delayed interval for cleaning up finished job data, mainly the shuffle
data, 0 means the cleaning up is disabled. |
+| finished-job-state-clean-up-interval-seconds | UInt64 | 3600 | Sets
the delayed interval for cleaning up finished job state stored in the backend,
0 means the cleaning up is disabled. |
diff --git a/docs/source/user-guide/deployment/docker-compose.md
b/docs/source/user-guide/deployment/docker-compose.md
index 67f40b7a..96f69d69 100644
--- a/docs/source/user-guide/deployment/docker-compose.md
+++ b/docs/source/user-guide/deployment/docker-compose.md
@@ -39,15 +39,11 @@ This should show output similar to the following:
```bash
$ docker-compose up
Creating network "ballista-benchmarks_default" with the default driver
-Creating ballista-benchmarks_etcd_1 ... done
Creating ballista-benchmarks_ballista-scheduler_1 ... done
Creating ballista-benchmarks_ballista-executor_1 ... done
-Attaching to ballista-benchmarks_etcd_1,
ballista-benchmarks_ballista-scheduler_1,
ballista-benchmarks_ballista-executor_1
-ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] Running
with config:
-ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor]
work_dir: /tmp/.tmpLVx39c
-ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor]
concurrent_tasks: 4
-ballista-scheduler_1 | [2021-08-28T15:55:22Z INFO ballista_scheduler]
Ballista v0.12.0 Scheduler listening on 0.0.0.0:50050
-ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor]
Ballista v0.12.0 Rust Executor listening on 0.0.0.0:50051
+Attaching to ballista-benchmarks_ballista-scheduler_1,
ballista-benchmarks_ballista-executor_1
+ballista-scheduler_1 | INFO ballista_scheduler: Ballista v51.0.0 Scheduler
listening on 0.0.0.0:50050
+ballista-executor_1 | INFO ballista_executor: Ballista v51.0.0 Rust Executor
listening on 0.0.0.0:50051
```
The scheduler listens on port 50050 and this is the port that clients will
need to connect to.
diff --git a/docs/source/user-guide/deployment/docker.md
b/docs/source/user-guide/deployment/docker.md
index a0542377..cf5488a6 100644
--- a/docs/source/user-guide/deployment/docker.md
+++ b/docs/source/user-guide/deployment/docker.md
@@ -67,13 +67,10 @@ Run `docker logs CONTAINER_ID` to check the output from the
process:
```
$ docker logs a756055576f3
-2024-02-03T14:49:47.904571Z INFO main ThreadId(01)
ballista_scheduler::cluster: Initializing Sled database in temp directory
-
-2024-02-03T14:49:47.924679Z INFO main ThreadId(01)
ballista_scheduler::scheduler_process: Ballista v0.12.0 Scheduler listening on
0.0.0.0:50050
-2024-02-03T14:49:47.924709Z INFO main ThreadId(01)
ballista_scheduler::scheduler_process: Starting Scheduler grpc server with task
scheduling policy of PullStaged
-2024-02-03T14:49:47.925261Z INFO main ThreadId(01)
ballista_scheduler::cluster::kv: Initializing heartbeat listener
-2024-02-03T14:49:47.925476Z INFO main ThreadId(01)
ballista_scheduler::scheduler_server::query_stage_scheduler: Starting
QueryStageScheduler
-2024-02-03T14:49:47.925587Z INFO tokio-runtime-worker ThreadId(47)
ballista_core::event_loop: Starting the event loop query_stage
+INFO ballista_scheduler::scheduler_process: Ballista v51.0.0 Scheduler
listening on 0.0.0.0:50050
+INFO ballista_scheduler::scheduler_process: Starting Scheduler grpc server
with task scheduling policy of PullStaged
+INFO ballista_scheduler::scheduler_server::query_stage_scheduler: Starting
QueryStageScheduler
+INFO ballista_core::event_loop: Starting the event loop query_stage
```
### Start Executors
@@ -99,11 +96,11 @@ Use `docker logs CONTAINER_ID` to check the output from the
executor(s):
```
$ docker logs fb8b530cee6d
-2024-02-03T14:50:24.061607Z INFO main ThreadId(01)
ballista_executor::executor_process: Running with config:
-2024-02-03T14:50:24.061649Z INFO main ThreadId(01)
ballista_executor::executor_process: work_dir: /tmp/.tmpAkP3pZ
-2024-02-03T14:50:24.061655Z INFO main ThreadId(01)
ballista_executor::executor_process: concurrent_tasks: 48
-2024-02-03T14:50:24.063256Z INFO tokio-runtime-worker ThreadId(44)
ballista_executor::executor_process: Ballista v0.12.0 Rust Executor Flight
Server listening on 0.0.0.0:50051
-2024-02-03T14:50:24.063281Z INFO tokio-runtime-worker ThreadId(47)
ballista_executor::execution_loop: Starting poll work loop with scheduler
+INFO ballista_executor::executor_process: Running with config:
+INFO ballista_executor::executor_process: work_dir: /tmp/.tmpAkP3pZ
+INFO ballista_executor::executor_process: concurrent_tasks: 48
+INFO ballista_executor::executor_process: Ballista v51.0.0 Rust Executor
Flight Server listening on 0.0.0.0:50051
+INFO ballista_executor::execution_loop: Starting poll work loop with scheduler
```
## Connect from the CLI
diff --git a/docs/source/user-guide/deployment/kubernetes.md
b/docs/source/user-guide/deployment/kubernetes.md
index d3062ed8..3e25d1f9 100644
--- a/docs/source/user-guide/deployment/kubernetes.md
+++ b/docs/source/user-guide/deployment/kubernetes.md
@@ -48,10 +48,10 @@ To create the required Docker images please refer to the
[docker deployment page
Once the images have been built, you can retag them and can push them to your
favourite Docker registry.
```bash
-docker tag apache/datafusion-ballista-scheduler:0.12.0
<your-repo>/datafusion-ballista-scheduler:0.12.0
-docker tag apache/datafusion-ballista-executor:0.12.0
<your-repo>/datafusion-ballista-executor:0.12.0
-docker push <your-repo>/datafusion-ballista-scheduler:0.12.0
-docker push <your-repo>/datafusion-ballista-executor:0.12.0
+docker tag apache/datafusion-ballista-scheduler:latest
<your-repo>/datafusion-ballista-scheduler:latest
+docker tag apache/datafusion-ballista-executor:latest
<your-repo>/datafusion-ballista-executor:latest
+docker push <your-repo>/datafusion-ballista-scheduler:latest
+docker push <your-repo>/datafusion-ballista-executor:latest
```
## Create Persistent Volume and Persistent Volume Claim
@@ -139,7 +139,7 @@ spec:
spec:
containers:
- name: ballista-scheduler
- image: <your-repo>/datafusion-ballista-scheduler:0.12.0
+ image: <your-repo>/datafusion-ballista-scheduler:latest
args: ["--bind-port=50050"]
ports:
- containerPort: 50050
@@ -169,7 +169,7 @@ spec:
spec:
containers:
- name: ballista-executor
- image: <your-repo>/datafusion-ballista-executor:0.12.0
+ image: <your-repo>/datafusion-ballista-executor:latest
args:
- "--bind-port=50051"
- "--scheduler-host=ballista-scheduler"
@@ -208,13 +208,13 @@ ballista-executor-78cc5b6486-7crdm 0/1 Pending 0
42s
ballista-scheduler-879f874c5-rnbd6 0/1 Pending 0 42s
```
-You can view the scheduler logs with `kubectl logs ballista-scheduler-0`:
+You can view the scheduler logs with `kubectl logs
ballista-scheduler-<pod-id>`:
```
-$ kubectl logs ballista-scheduler-0
-[2021-02-19T00:24:01Z INFO scheduler] Ballista v0.7.0 Scheduler listening on
0.0.0.0:50050
-[2021-02-19T00:24:16Z INFO ballista::scheduler] Received register_executor
request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188",
host: "10.1.23.149", port: 50051 }
-[2021-02-19T00:24:17Z INFO ballista::scheduler] Received register_executor
request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f",
host: "10.1.23.150", port: 50051 }
+$ kubectl logs ballista-scheduler-<pod-id>
+INFO ballista_scheduler::scheduler_process: Ballista v51.0.0 Scheduler
listening on 0.0.0.0:50050
+INFO ballista_scheduler::scheduler_server::grpc: Received register_executor
request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188",
host: "10.1.23.149", port: 50051 }
+INFO ballista_scheduler::scheduler_server::grpc: Received register_executor
request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f",
host: "10.1.23.150", port: 50051 }
```
## Port Forwarding
diff --git a/docs/source/user-guide/python.md b/docs/source/user-guide/python.md
index f17ac68d..e7d13968 100644
--- a/docs/source/user-guide/python.md
+++ b/docs/source/user-guide/python.md
@@ -117,12 +117,8 @@ The following example demonstrates creating arrays with
PyArrow and then creatin
from ballista import BallistaBuilder
import pyarrow
-# an alias
-# TODO implement Functions
-f = ballista.functions
-
# create a context
-ctx = Ballista().standalone()
+ctx = BallistaBuilder().standalone()
# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
@@ -132,9 +128,10 @@ batch = pyarrow.RecordBatch.from_arrays(
df = ctx.create_dataframe([[batch]])
# create a new statement
+from datafusion import col
df = df.select(
- f.col("a") + f.col("b"),
- f.col("a") - f.col("b"),
+ col("a") + col("b"),
+ col("a") - col("b"),
)
# execute and collect the first (and only) batch
diff --git a/docs/source/user-guide/tuning-guide.md
b/docs/source/user-guide/tuning-guide.md
index 22955b44..fe4363df 100644
--- a/docs/source/user-guide/tuning-guide.md
+++ b/docs/source/user-guide/tuning-guide.md
@@ -73,8 +73,8 @@ which is the best for your use case.
Pull-based scheduling works in a similar way to Apache Spark and push-based
scheduling can result in lower latency.
-The scheduling policy can be specified in the `--scheduler_policy` parameter
when starting the scheduler and executor
-processes. The default is `pull-based`.
+The scheduling policy can be specified in the `--scheduler-policy` parameter
when starting the scheduler and executor
+processes. The default is `pull-staged`.
## Viewing Query Plans and Metrics
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]