(datafusion-ballista) branch main updated: docs: fix outdated content in documentation (#1385)

milenkovicm Sat, 17 Jan 2026 11:19:07 -0800

This is an automated email from the ASF dual-hosted git repository.

milenkovicm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-ballista.git



The following commit(s) were added to refs/heads/main by this push:
     new 5ec9ae6b docs: fix outdated content in documentation (#1385)
5ec9ae6b is described below

commit 5ec9ae6b0b5855e7744aa93e6a9acfee25f5ceab
Author: Andy Grove <[email protected]>
AuthorDate: Sat Jan 17 12:18:56 2026 -0700

    docs: fix outdated content in documentation (#1385)
    
    * docs: fix outdated content in documentation
    
    - Remove outdated etcd references (etcd backend was removed)
    - Update version numbers from old versions to v51.0.0
    - Fix executor-slots-policy to task-distribution with correct values
    - Remove Sled database references from docker.md
    - Update kubernetes.md docker tags and log output format
    - Fix Python API: Ballista() -> BallistaBuilder()
    - Fix scheduler-policy parameter name and default value
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * chore: add CLAUDE.md to .gitignore
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * style: format markdown with prettier
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    * docs: add benchmarking section to contributors guide
    
    Link to benchmarks/README.md for TPC-H and performance testing instructions.
    
    Co-Authored-By: Claude Opus 4.5 <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.5 <[email protected]>
---
 .gitignore                                         |  3 +++
 docs/developer/architecture.md                     |  2 +-
 docs/source/contributors-guide/architecture.md     |  3 ---
 docs/source/contributors-guide/development.md      | 11 +++++++++++
 docs/source/user-guide/cli.md                      |  2 +-
 docs/source/user-guide/configs.md                  | 17 ++++++++---------
 .../source/user-guide/deployment/docker-compose.md | 10 +++-------
 docs/source/user-guide/deployment/docker.md        | 21 +++++++++------------
 docs/source/user-guide/deployment/kubernetes.md    | 22 +++++++++++-----------
 docs/source/user-guide/python.md                   | 11 ++++-------
 docs/source/user-guide/tuning-guide.md             |  4 ++--
 11 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/.gitignore b/.gitignore
index db1b4fe6..9552147c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -102,3 +102,6 @@ dev/dist
 
 # logs
 logs/
+
+# Claude Code guidance file (local only)
+CLAUDE.md
diff --git a/docs/developer/architecture.md b/docs/developer/architecture.md
index d49e5013..3e00346b 100644
--- a/docs/developer/architecture.md
+++ b/docs/developer/architecture.md
@@ -60,7 +60,7 @@ The scheduler process implements a gRPC interface (defined in
 | GetJobStatus         | Get the status of a submitted query                   
               |
 | RegisterExecutor     | Executors call this method to register themselves 
with the scheduler |
 
-The scheduler can run in standalone mode, or can be run in clustered mode 
using etcd as backing store for state.
+The scheduler currently uses in-memory state storage.
 
 ## Executor Process
 
diff --git a/docs/source/contributors-guide/architecture.md 
b/docs/source/contributors-guide/architecture.md
index b0e00730..b25321d0 100644
--- a/docs/source/contributors-guide/architecture.md
+++ b/docs/source/contributors-guide/architecture.md
@@ -80,9 +80,6 @@ plan or a SQL string. The scheduler then creates an execution 
graph, which conta
 stages (pipelines) that can be scheduled independently. This process is 
explained in detail in the Distributed
 Query Scheduling section of this guide.
 
-It is possible to have multiple schedulers running with shared state in etcd, 
so that jobs can continue to run
-even if a scheduler process fails.
-
 ### Executor
 
 The executor processes connect to a scheduler and poll for tasks to perform. 
These tasks are physical plans in
diff --git a/docs/source/contributors-guide/development.md 
b/docs/source/contributors-guide/development.md
index a21595b1..feefc8a6 100644
--- a/docs/source/contributors-guide/development.md
+++ b/docs/source/contributors-guide/development.md
@@ -65,3 +65,14 @@ cargo test
 cd examples
 cargo run --example standalone_sql --features=ballista/standalone
 ```
+
+## Benchmarking
+
+For performance testing and benchmarking with TPC-H and other datasets, see 
the [benchmarks README](../../../benchmarks/README.md).
+
+This includes instructions for:
+
+- Generating TPC-H test data
+- Running benchmarks against DataFusion and Ballista
+- Comparing performance with Apache Spark
+- Running load tests
diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md
index 213f6034..597bc195 100644
--- a/docs/source/user-guide/cli.md
+++ b/docs/source/user-guide/cli.md
@@ -71,7 +71,7 @@ It is also possible to run the CLI in standalone mode, where 
it will create a sc
 ```bash
 $ ballista-cli
 
-Ballista CLI v8.0.0
+Ballista CLI v51.0.0
 
 > CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
 0 rows in set. Query took 0.001 seconds.
diff --git a/docs/source/user-guide/configs.md 
b/docs/source/user-guide/configs.md
index 5e909e15..56b847e1 100644
--- a/docs/source/user-guide/configs.md
+++ b/docs/source/user-guide/configs.md
@@ -96,14 +96,13 @@ manage the whole cluster are also needed to be taken care 
of.
 _Example: Specifying configuration options when starting the scheduler_
 
 ```shell
-./ballista-scheduler --scheduler-policy push-staged --event-loop-buffer-size 
1000000 --executor-slots-policy
-round-robin-local
+./ballista-scheduler --scheduler-policy push-staged --event-loop-buffer-size 
1000000 --task-distribution round-robin
 ```
 
-| key                                          | type   | default     | 
description                                                                     
                                                                                
                |
-| -------------------------------------------- | ------ | ----------- | 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| scheduler-policy                             | Utf8   | pull-staged | Sets 
the task scheduling policy for the scheduler, possible values: pull-staged, 
push-staged.                                                                    
               |
-| event-loop-buffer-size                       | UInt32 | 10000       | Sets 
the event loop buffer size. for a system of high throughput, a larger value 
like 1000000 is recommended.                                                    
               |
-| executor-slots-policy                        | Utf8   | bias        | Sets 
the executor slots policy for the scheduler, possible values: bias, 
round-robin, round-robin-local. For a cluster with single scheduler, 
round-robin-local is recommended. |
-| finished-job-data-clean-up-interval-seconds  | UInt64 | 300         | Sets 
the delayed interval for cleaning up finished job data, mainly the shuffle 
data, 0 means the cleaning up is disabled.                                      
                |
-| finished-job-state-clean-up-interval-seconds | UInt64 | 3600        | Sets 
the delayed interval for cleaning up finished job state stored in the backend, 
0 means the cleaning up is disabled.                                            
            |
+| key                                          | type   | default     | 
description                                                                     
                                           |
+| -------------------------------------------- | ------ | ----------- | 
--------------------------------------------------------------------------------------------------------------------------
 |
+| scheduler-policy                             | Utf8   | pull-staged | Sets 
the task scheduling policy for the scheduler, possible values: pull-staged, 
push-staged.                              |
+| event-loop-buffer-size                       | UInt32 | 10000       | Sets 
the event loop buffer size. for a system of high throughput, a larger value 
like 1000000 is recommended.              |
+| task-distribution                            | Utf8   | bias        | Sets 
the task distribution policy for the scheduler, possible values: bias, 
round-robin, consistent-hash.                  |
+| finished-job-data-clean-up-interval-seconds  | UInt64 | 300         | Sets 
the delayed interval for cleaning up finished job data, mainly the shuffle 
data, 0 means the cleaning up is disabled. |
+| finished-job-state-clean-up-interval-seconds | UInt64 | 3600        | Sets 
the delayed interval for cleaning up finished job state stored in the backend, 
0 means the cleaning up is disabled.   |
diff --git a/docs/source/user-guide/deployment/docker-compose.md 
b/docs/source/user-guide/deployment/docker-compose.md
index 67f40b7a..96f69d69 100644
--- a/docs/source/user-guide/deployment/docker-compose.md
+++ b/docs/source/user-guide/deployment/docker-compose.md
@@ -39,15 +39,11 @@ This should show output similar to the following:
 ```bash
 $ docker-compose up
 Creating network "ballista-benchmarks_default" with the default driver
-Creating ballista-benchmarks_etcd_1 ... done
 Creating ballista-benchmarks_ballista-scheduler_1 ... done
 Creating ballista-benchmarks_ballista-executor_1  ... done
-Attaching to ballista-benchmarks_etcd_1, 
ballista-benchmarks_ballista-scheduler_1, 
ballista-benchmarks_ballista-executor_1
-ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] Running 
with config:
-ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] 
work_dir: /tmp/.tmpLVx39c
-ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] 
concurrent_tasks: 4
-ballista-scheduler_1  | [2021-08-28T15:55:22Z INFO  ballista_scheduler] 
Ballista v0.12.0 Scheduler listening on 0.0.0.0:50050
-ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] 
Ballista v0.12.0 Rust Executor listening on 0.0.0.0:50051
+Attaching to ballista-benchmarks_ballista-scheduler_1, 
ballista-benchmarks_ballista-executor_1
+ballista-scheduler_1  | INFO ballista_scheduler: Ballista v51.0.0 Scheduler 
listening on 0.0.0.0:50050
+ballista-executor_1   | INFO ballista_executor: Ballista v51.0.0 Rust Executor 
listening on 0.0.0.0:50051
 ```
 
 The scheduler listens on port 50050 and this is the port that clients will 
need to connect to.
diff --git a/docs/source/user-guide/deployment/docker.md 
b/docs/source/user-guide/deployment/docker.md
index a0542377..cf5488a6 100644
--- a/docs/source/user-guide/deployment/docker.md
+++ b/docs/source/user-guide/deployment/docker.md
@@ -67,13 +67,10 @@ Run `docker logs CONTAINER_ID` to check the output from the 
process:
 
 ```
 $ docker logs a756055576f3
-2024-02-03T14:49:47.904571Z  INFO main ThreadId(01) 
ballista_scheduler::cluster: Initializing Sled database in temp directory
-
-2024-02-03T14:49:47.924679Z  INFO main ThreadId(01) 
ballista_scheduler::scheduler_process: Ballista v0.12.0 Scheduler listening on 
0.0.0.0:50050
-2024-02-03T14:49:47.924709Z  INFO main ThreadId(01) 
ballista_scheduler::scheduler_process: Starting Scheduler grpc server with task 
scheduling policy of PullStaged
-2024-02-03T14:49:47.925261Z  INFO main ThreadId(01) 
ballista_scheduler::cluster::kv: Initializing heartbeat listener
-2024-02-03T14:49:47.925476Z  INFO main ThreadId(01) 
ballista_scheduler::scheduler_server::query_stage_scheduler: Starting 
QueryStageScheduler
-2024-02-03T14:49:47.925587Z  INFO tokio-runtime-worker ThreadId(47) 
ballista_core::event_loop: Starting the event loop query_stage
+INFO ballista_scheduler::scheduler_process: Ballista v51.0.0 Scheduler 
listening on 0.0.0.0:50050
+INFO ballista_scheduler::scheduler_process: Starting Scheduler grpc server 
with task scheduling policy of PullStaged
+INFO ballista_scheduler::scheduler_server::query_stage_scheduler: Starting 
QueryStageScheduler
+INFO ballista_core::event_loop: Starting the event loop query_stage
 ```
 
 ### Start Executors
@@ -99,11 +96,11 @@ Use `docker logs CONTAINER_ID` to check the output from the 
executor(s):
 
 ```
 $ docker logs fb8b530cee6d
-2024-02-03T14:50:24.061607Z  INFO main ThreadId(01) 
ballista_executor::executor_process: Running with config:
-2024-02-03T14:50:24.061649Z  INFO main ThreadId(01) 
ballista_executor::executor_process: work_dir: /tmp/.tmpAkP3pZ
-2024-02-03T14:50:24.061655Z  INFO main ThreadId(01) 
ballista_executor::executor_process: concurrent_tasks: 48
-2024-02-03T14:50:24.063256Z  INFO tokio-runtime-worker ThreadId(44) 
ballista_executor::executor_process: Ballista v0.12.0 Rust Executor Flight 
Server listening on 0.0.0.0:50051
-2024-02-03T14:50:24.063281Z  INFO tokio-runtime-worker ThreadId(47) 
ballista_executor::execution_loop: Starting poll work loop with scheduler
+INFO ballista_executor::executor_process: Running with config:
+INFO ballista_executor::executor_process: work_dir: /tmp/.tmpAkP3pZ
+INFO ballista_executor::executor_process: concurrent_tasks: 48
+INFO ballista_executor::executor_process: Ballista v51.0.0 Rust Executor 
Flight Server listening on 0.0.0.0:50051
+INFO ballista_executor::execution_loop: Starting poll work loop with scheduler
 ```
 
 ## Connect from the CLI
diff --git a/docs/source/user-guide/deployment/kubernetes.md 
b/docs/source/user-guide/deployment/kubernetes.md
index d3062ed8..3e25d1f9 100644
--- a/docs/source/user-guide/deployment/kubernetes.md
+++ b/docs/source/user-guide/deployment/kubernetes.md
@@ -48,10 +48,10 @@ To create the required Docker images please refer to the 
[docker deployment page
 Once the images have been built, you can retag them and can push them to your 
favourite Docker registry.
 
 ```bash
-docker tag apache/datafusion-ballista-scheduler:0.12.0 
<your-repo>/datafusion-ballista-scheduler:0.12.0
-docker tag apache/datafusion-ballista-executor:0.12.0 
<your-repo>/datafusion-ballista-executor:0.12.0
-docker push <your-repo>/datafusion-ballista-scheduler:0.12.0
-docker push <your-repo>/datafusion-ballista-executor:0.12.0
+docker tag apache/datafusion-ballista-scheduler:latest 
<your-repo>/datafusion-ballista-scheduler:latest
+docker tag apache/datafusion-ballista-executor:latest 
<your-repo>/datafusion-ballista-executor:latest
+docker push <your-repo>/datafusion-ballista-scheduler:latest
+docker push <your-repo>/datafusion-ballista-executor:latest
 ```
 
 ## Create Persistent Volume and Persistent Volume Claim
@@ -139,7 +139,7 @@ spec:
     spec:
       containers:
         - name: ballista-scheduler
-          image: <your-repo>/datafusion-ballista-scheduler:0.12.0
+          image: <your-repo>/datafusion-ballista-scheduler:latest
           args: ["--bind-port=50050"]
           ports:
             - containerPort: 50050
@@ -169,7 +169,7 @@ spec:
     spec:
       containers:
         - name: ballista-executor
-          image: <your-repo>/datafusion-ballista-executor:0.12.0
+          image: <your-repo>/datafusion-ballista-executor:latest
           args:
             - "--bind-port=50051"
             - "--scheduler-host=ballista-scheduler"
@@ -208,13 +208,13 @@ ballista-executor-78cc5b6486-7crdm   0/1     Pending   0  
        42s
 ballista-scheduler-879f874c5-rnbd6   0/1     Pending   0          42s
 ```
 
-You can view the scheduler logs with `kubectl logs ballista-scheduler-0`:
+You can view the scheduler logs with `kubectl logs 
ballista-scheduler-<pod-id>`:
 
 ```
-$ kubectl logs ballista-scheduler-0
-[2021-02-19T00:24:01Z INFO  scheduler] Ballista v0.7.0 Scheduler listening on 
0.0.0.0:50050
-[2021-02-19T00:24:16Z INFO  ballista::scheduler] Received register_executor 
request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", 
host: "10.1.23.149", port: 50051 }
-[2021-02-19T00:24:17Z INFO  ballista::scheduler] Received register_executor 
request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", 
host: "10.1.23.150", port: 50051 }
+$ kubectl logs ballista-scheduler-<pod-id>
+INFO ballista_scheduler::scheduler_process: Ballista v51.0.0 Scheduler 
listening on 0.0.0.0:50050
+INFO ballista_scheduler::scheduler_server::grpc: Received register_executor 
request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", 
host: "10.1.23.149", port: 50051 }
+INFO ballista_scheduler::scheduler_server::grpc: Received register_executor 
request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", 
host: "10.1.23.150", port: 50051 }
 ```
 
 ## Port Forwarding
diff --git a/docs/source/user-guide/python.md b/docs/source/user-guide/python.md
index f17ac68d..e7d13968 100644
--- a/docs/source/user-guide/python.md
+++ b/docs/source/user-guide/python.md
@@ -117,12 +117,8 @@ The following example demonstrates creating arrays with 
PyArrow and then creatin
 from ballista import BallistaBuilder
 import pyarrow
 
-# an alias
-# TODO implement Functions
-f = ballista.functions
-
 # create a context
-ctx = Ballista().standalone()
+ctx = BallistaBuilder().standalone()
 
 # create a RecordBatch and a new DataFrame from it
 batch = pyarrow.RecordBatch.from_arrays(
@@ -132,9 +128,10 @@ batch = pyarrow.RecordBatch.from_arrays(
 df = ctx.create_dataframe([[batch]])
 
 # create a new statement
+from datafusion import col
 df = df.select(
-    f.col("a") + f.col("b"),
-    f.col("a") - f.col("b"),
+    col("a") + col("b"),
+    col("a") - col("b"),
 )
 
 # execute and collect the first (and only) batch
diff --git a/docs/source/user-guide/tuning-guide.md 
b/docs/source/user-guide/tuning-guide.md
index 22955b44..fe4363df 100644
--- a/docs/source/user-guide/tuning-guide.md
+++ b/docs/source/user-guide/tuning-guide.md
@@ -73,8 +73,8 @@ which is the best for your use case.
 
 Pull-based scheduling works in a similar way to Apache Spark and push-based 
scheduling can result in lower latency.
 
-The scheduling policy can be specified in the `--scheduler_policy` parameter 
when starting the scheduler and executor
-processes. The default is `pull-based`.
+The scheduling policy can be specified in the `--scheduler-policy` parameter 
when starting the scheduler and executor
+processes. The default is `pull-staged`.
 
 ## Viewing Query Plans and Metrics
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-ballista) branch main updated: docs: fix outdated content in documentation (#1385)

Reply via email to