This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git
The following commit(s) were added to refs/heads/master by this push:
new 90774daf User Guide improvements (#274)
90774daf is described below
commit 90774daf44c2c189f1b1d89ff316bdcb2e3757ad
Author: Andy Grove <[email protected]>
AuthorDate: Thu Sep 29 07:54:16 2022 -0600
User Guide improvements (#274)
---
docs/README.md | 38 +++++++-----
docs/build.sh | 21 +++++++
.../deployment => developer}/configuration.md | 0
docs/source/index.rst | 27 ++++++--
docs/source/user-guide/cli.md | 69 +++++++++------------
.../source/user-guide/deployment/docker-compose.md | 4 +-
docs/source/user-guide/deployment/docker.md | 4 +-
docs/source/user-guide/deployment/index.rst | 9 ++-
docs/source/user-guide/flightsql.md | 2 +-
docs/source/user-guide/images/ballista-web-ui.png | Bin 0 -> 234939 bytes
.../user-guide/images/example-query-plan.png | Bin 0 -> 192455 bytes
docs/source/user-guide/introduction.md | 29 +++++----
docs/source/user-guide/python.md | 47 +++++++++++++-
docs/source/user-guide/rust.md | 5 +-
docs/source/user-guide/scheduler.md | 36 +++++++++++
docs/source/user-guide/tuning-guide.md | 21 +++++++
16 files changed, 220 insertions(+), 92 deletions(-)
diff --git a/docs/README.md b/docs/README.md
index 920ba494..9813be4b 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -17,16 +17,20 @@
under the License.
-->
-# Developer Documentation
+# Ballista Documentation
-Developer documentation can be found [here](developer/README.md).
-User documentation can be found [here](source/user-guide/introduction.md).
+## User Documentation
+
+Documentation for the current published release can be found at
https://arrow.apache.org/ballista and the source
+content is located [here](source/user-guide/introduction.md).
-# User Documentation
+## Developer Documentation
+
+Developer documentation can be found [here](developer/README.md).
-_These instructions were forked from the `arrow-datafusion` repository and are
outdated_
+## Building the User Guide
-## Dependencies
+### Dependencies
It's recommended to install build dependencies and build the documentation
inside a Python virtualenv.
@@ -38,21 +42,21 @@ inside a Python virtualenv.
## Build
```bash
-make html
+./build.sh
```
## Release
-The documentation is served through the
-[arrow-site](https://github.com/apache/arrow-site/) repo. To release a new
-version of the docs, follow these steps:
+The documentation is served through the
[arrow-site](https://github.com/apache/arrow-site/) repository. To release
+a new version of the documentation, follow these steps:
-1. Run `make html` inside `docs` folder to generate the docs website inside
the `build/html` folder.
-2. Clone the arrow-site repo
-3. Checkout to the `asf-site` branch (NOT `master`)
-4. Copy build artifacts into `arrow-site` repo's `datafusion` folder with a
command such as
+1. Download the release source tarball (we can only publish documentation from
official releases)
+2. Run `./build.sh` inside `docs` folder to generate the docs website inside
the `build/html` folder.
+3. Clone the arrow-site repo
+4. Checkout to the `asf-site` branch (NOT `master`)
+5. Copy build artifacts into `arrow-site` repo's `ballista` folder with a
command such as
-- `cp -rT ./build/html/ ../../arrow-site/datafusion/` (doesn't work on mac)
-- `rsync -avzr ./build/html/ ../../arrow-site/datafusion/`
+- `cp -rT ./build/html/ ../../arrow-site/ballista/` (doesn't work on mac)
+- `rsync -avzr ./build/html/ ../../arrow-site/ballista/`
-5. Commit changes in `arrow-site` and send a PR.
+6. Commit changes in `arrow-site` and send a PR.
diff --git a/docs/build.sh b/docs/build.sh
new file mode 100755
index 00000000..ba25a933
--- /dev/null
+++ b/docs/build.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+rm -rf build
+make html
diff --git a/docs/source/user-guide/deployment/configuration.md
b/docs/developer/configuration.md
similarity index 100%
rename from docs/source/user-guide/deployment/configuration.md
rename to docs/developer/configuration.md
diff --git a/docs/source/index.rst b/docs/source/index.rst
index a9343b1d..270c475e 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -29,11 +29,28 @@ Table of content
:maxdepth: 1
:caption: User Guide
- user-guide/introduction
- user-guide/deployment/index
- user-guide/python
- user-guide/rust
- user-guide/cli
+ Introduction <user-guide/introduction>
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Cluster Deployment
+
+ Deployment <user-guide/deployment/index>
+ Scheduler <user-guide/scheduler>
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Clients
+
+ Python <user-guide/python>
+ Rust <user-guide/rust>
+ Flight SQL JDBC <user-guide/flightsql>
+ SQL CLI <user-guide/cli>
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Reference
+
user-guide/configs
user-guide/tuning-guide
user-guide/faq
diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md
index 2ec9c8d4..f8bc0805 100644
--- a/docs/source/user-guide/cli.md
+++ b/docs/source/user-guide/cli.md
@@ -17,27 +17,35 @@
under the License.
-->
-# DataFusion Command-line Interface
+# Ballista Command-line Interface
-The DataFusion CLI allows SQL queries to be executed by an in-process
DataFusion context, or by a distributed
-Ballista context.
+The Ballista CLI allows SQL queries to be executed against a Ballista cluster,
or in standalone mode in a single
+process.
+Use Cargo to install:
+
+```bash
+cargo install ballista-cli
```
-USAGE:
- datafusion-cli [FLAGS] [OPTIONS]
-FLAGS:
- -h, --help Prints help information
- -q, --quiet Reduce printing other than the results and work quietly
- -V, --version Prints version information
+## Usage
+
+```
+USAGE:
+ ballista-cli [OPTIONS]
OPTIONS:
- -c, --batch-size <batch-size> The batch size of each query, or use
DataFusion default
- -p, --data-path <data-path> Path to your data, default to current
directory
- -f, --file <file>... Execute commands from file(s), then exit
- --format <format> Output format [default: table] [possible
values: csv, tsv, table, json, ndjson]
- --host <host> Ballista scheduler host
- --port <port> Ballista scheduler port
+ -c, --batch-size <BATCH_SIZE> The batch size of each query, or use
DataFusion default
+ -f, --file <FILE>... Execute commands from file(s), then exit
+ --format <FORMAT> [default: table] [possible values: csv,
tsv, table, json,
+ nd-json]
+ -h, --help Print help information
+ --host <HOST> Ballista scheduler host
+ -p, --data-path <DATA_PATH> Path to your data, default to current
directory
+ --port <PORT> Ballista scheduler port
+ -q, --quiet Reduce printing other than the results
and work quietly
+ -r, --rc <RC>... Run the provided files on startup instead
of ~/.datafusionrc
+ -V, --version Print version information
```
## Example
@@ -48,10 +56,12 @@ Create a CSV file to query.
$ echo "1,2" > data.csv
```
+## Run Ballista CLI in Standalone Mode
+
```bash
-$ datafusion-cli
+$ ballista-cli
-DataFusion CLI v8.0.0
+Ballista CLI v8.0.0
> CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
0 rows in set. Query took 0.001 seconds.
@@ -65,28 +75,9 @@ DataFusion CLI v8.0.0
1 row in set. Query took 0.017 seconds.
```
-## DataFusion-Cli
-
-Build the `datafusion-cli` without the feature of ballista.
-
-```bash
-cd arrow-datafusion/datafusion-cli
-cargo build
-```
-
-## Ballista
-
-The DataFusion CLI can also connect to a Ballista scheduler for query
execution.
-
-Before you use the `datafusion-cli` to connect the Ballista scheduler, you
should build/compile
-the `datafusion-cli` with feature of "ballista" first.
-
-```bash
-cd arrow-datafusion/datafusion-cli
-cargo build --features ballista
-```
+## Run Ballista CLI in Distributed Mode
-Then, you can connect the Ballista by below command.
+The CLI can also connect to a Ballista scheduler for query execution.
```bash
datafusion-cli --host localhost --port 50050
@@ -94,7 +85,7 @@ datafusion-cli --host localhost --port 50050
## Cli commands
-Available commands inside DataFusion CLI are:
+Available commands inside Ballista CLI are:
- Quit
diff --git a/docs/source/user-guide/deployment/docker-compose.md
b/docs/source/user-guide/deployment/docker-compose.md
index 1a2f374a..8e092ac5 100644
--- a/docs/source/user-guide/deployment/docker-compose.md
+++ b/docs/source/user-guide/deployment/docker-compose.md
@@ -28,8 +28,8 @@ There is no officially published Docker image so it is
currently necessary to bu
Run the following commands to clone the source repository and build the Docker
image.
```bash
-git clone [email protected]:apache/arrow-datafusion.git -b 8.0.0
-cd arrow-datafusion
+git clone [email protected]:apache/arrow-ballista.git -b 8.0.0
+cd arrow-ballista
./dev/build-ballista-docker.sh
```
diff --git a/docs/source/user-guide/deployment/docker.md
b/docs/source/user-guide/deployment/docker.md
index c6188aae..f5de816e 100644
--- a/docs/source/user-guide/deployment/docker.md
+++ b/docs/source/user-guide/deployment/docker.md
@@ -26,8 +26,8 @@ There is no officially published Docker image so it is
currently necessary to bu
Run the following commands to clone the source repository and build the Docker
image.
```bash
-git clone [email protected]:apache/arrow-datafusion.git -b 8.0.0
-cd arrow-datafusion
+git clone [email protected]:apache/arrow-ballista.git -b 8.0.0
+cd arrow-ballista
./dev/build-ballista-docker.sh
```
diff --git a/docs/source/user-guide/deployment/index.rst
b/docs/source/user-guide/deployment/index.rst
index ad9c0714..73b3a8e3 100644
--- a/docs/source/user-guide/deployment/index.rst
+++ b/docs/source/user-guide/deployment/index.rst
@@ -21,8 +21,7 @@ Start a Ballista Cluster
.. toctree::
:maxdepth: 2
- cargo-install
- docker
- docker-compose
- kubernetes
- configuration
+ Cargo <cargo-install>
+ Docker <docker>
+ Docker Compose <docker-compose>
+ Kubernetes <kubernetes>
diff --git a/docs/source/user-guide/flightsql.md
b/docs/source/user-guide/flightsql.md
index cd7b9fce..57410642 100644
--- a/docs/source/user-guide/flightsql.md
+++ b/docs/source/user-guide/flightsql.md
@@ -117,7 +117,7 @@ To register a table, find a `.csv`, `.json`, or `.parquet`
file for testing, and
```sql
create external table customer stored as CSV with header row
- location
'/home/username/arrow-datafusion/datafusion/core/tests/tpch-csv/customer.csv';
+ location '/path/to/customer.csv';
```
Once the table has been registered, all the normal SQL queries can be
performed:
diff --git a/docs/source/user-guide/images/ballista-web-ui.png
b/docs/source/user-guide/images/ballista-web-ui.png
new file mode 100644
index 00000000..9784d144
Binary files /dev/null and b/docs/source/user-guide/images/ballista-web-ui.png
differ
diff --git a/docs/source/user-guide/images/example-query-plan.png
b/docs/source/user-guide/images/example-query-plan.png
new file mode 100644
index 00000000..6ead4e62
Binary files /dev/null and
b/docs/source/user-guide/images/example-query-plan.png differ
diff --git a/docs/source/user-guide/introduction.md
b/docs/source/user-guide/introduction.md
index db7d0e13..65cbe2f7 100644
--- a/docs/source/user-guide/introduction.md
+++ b/docs/source/user-guide/introduction.md
@@ -19,16 +19,19 @@
# Overview
-Ballista is a distributed compute platform primarily implemented in Rust, and
powered by Apache Arrow. It is
-built on an architecture that allows other programming languages to be
supported as first-class citizens without paying
-a penalty for serialization costs.
+Ballista is a distributed compute platform primarily implemented in Rust, and
powered by Apache Arrow.
-The foundational technologies in Ballista are:
+Ballista has a scheduler and an executor process that are standard Rust
executables and can be executed directly, but
+Dockerfiles are provided to build images for use in containerized
environments, such as Docker, Docker Compose, and
+Kubernetes. See the [deployment guide](deployment.md) for more information
-- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels
for efficient processing of data.
-- [Apache Arrow Flight
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
for efficient data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
for serializing query plans.
-- [DataFusion](https://github.com/apache/arrow-datafusion/) for query
execution.
+SQL and DataFrame queries can be submitted from Python and Rust, and SQL
queries can be submitted via the Arrow
+Flight SQL JDBC driver, supporting your favorite JDBC compliant tools such as
[DataGrip](datagrip)
+or [tableau](tableau). For setup instructions, please see the [FlightSQL
guide](flightsql.md).
+
+The scheduler has a web user interface for monitoring query status as well as
a REST API.
+
+
## How does this compare to Apache Spark?
@@ -45,10 +48,6 @@ Although Ballista is largely inspired by Apache Spark, there
are some key differ
- The use of Apache Arrow as the memory model and network protocol means that
data can be exchanged between executors
in any programming language with minimal serialization overhead.
-## Status
-
-Ballista is still in the early stages of development but is capable of
executing complex analytical queries at scale.
-
-## Usage
-
-Ballista can be used from your favorite JDBC compliant tools such as
[DataGrip](https://www.jetbrains.com/datagrip/) or
[tableau](https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases_jdbc.htm).
For setup instructions, please see the [FlightSQL guide](flightsql.md).
+[deployment](./deployment)
+[datagrip](https://www.jetbrains.com/datagrip/)
+[tableau](https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases_jdbc.htm)
diff --git a/docs/source/user-guide/python.md b/docs/source/user-guide/python.md
index 3bd4fe50..4a907566 100644
--- a/docs/source/user-guide/python.md
+++ b/docs/source/user-guide/python.md
@@ -21,6 +21,9 @@
Ballista provides Python bindings, allowing SQL and DataFrame queries to be
executed from the Python shell.
+Like PySpark, it allows you to build a plan through SQL or a DataFrame API
against Parquet, CSV, JSON, and other
+popular file formats files, run it in a distributed environment, and obtain
the result back in Python.
+
## Connecting to a Cluster
The following code demonstrates how to create a Ballista context and connect
to a scheduler.
@@ -30,7 +33,13 @@ The following code demonstrates how to create a Ballista
context and connect to
>>> ctx = ballista.BallistaContext("localhost", 50050)
```
-## Registering Tables
+## SQL
+
+The Python bindings support executing SQL queries as well.
+
+### Registering Tables
+
+Before SQL queries can be executed, tables need to be registered with the
context.
Tables can be registered against the context by calling one of the `register`
methods, or by executing SQL.
@@ -42,7 +51,7 @@ Tables can be registered against the context by calling one
of the `register` me
>>> ctx.sql("CREATE EXTERNAL TABLE trips STORED AS PARQUET LOCATION
>>> '/mnt/bigdata/nyctaxi'")
```
-## Executing Queries
+### Executing Queries
The `sql` method creates a `DataFrame`. The query is executed when an action
such as `show` or `collect` is executed.
@@ -88,3 +97,37 @@ The `explain` method can be used to show the logical and
physical query plans fo
| | |
+---------------+-------------------------------------------------------------+
```
+
+## DataFrame
+
+The following example demonstrates creating arrays with PyArrow and then
creating a Ballista DataFrame.
+
+```python
+import ballista
+import pyarrow
+
+# an alias
+f = ballista.functions
+
+# create a context
+ctx = ballista.BallistaContext("localhost", 50050)
+
+# create a RecordBatch and a new DataFrame from it
+batch = pyarrow.RecordBatch.from_arrays(
+ [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
+ names=["a", "b"],
+)
+df = ctx.create_dataframe([[batch]])
+
+# create a new statement
+df = df.select(
+ f.col("a") + f.col("b"),
+ f.col("a") - f.col("b"),
+)
+
+# execute and collect the first (and only) batch
+result = df.collect()[0]
+
+assert result.column(0) == pyarrow.array([5, 7, 9])
+assert result.column(1) == pyarrow.array([-3, -3, -3])
+```
diff --git a/docs/source/user-guide/rust.md b/docs/source/user-guide/rust.md
index 8788d058..e3015e76 100644
--- a/docs/source/user-guide/rust.md
+++ b/docs/source/user-guide/rust.md
@@ -19,10 +19,7 @@
# Ballista Rust Client
-Ballista usage is very similar to DataFusion. Tha main difference is that the
starting point is a `BallistaContext`
-instead of the DataFusion `SessionContext`. Ballista uses the same DataFrame
API as DataFusion.
-
-The following code sample demonstrates how to create a `BallistaContext` to
connect to a Ballista scheduler process.
+To connect to a Ballista cluster from Rust, first start by creating a
`BallistaContext`.
```rust
let config = BallistaConfig::builder()
diff --git a/docs/source/user-guide/scheduler.md
b/docs/source/user-guide/scheduler.md
new file mode 100644
index 00000000..21d8f876
--- /dev/null
+++ b/docs/source/user-guide/scheduler.md
@@ -0,0 +1,36 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Ballista Scheduler
+
+## Web User Interface
+
+The scheduler provides a web user interface that allows queries to be
monitored.
+
+
+
+## REST API
+
+The scheduler also provides a REST API that allows jobs to be monitored.
+
+| API | Description
|
+| --------------------- |
----------------------------------------------------------- |
+| /api/jobs | Get a list of jobs that have been submitted to the
cluster. |
+| /api/job/{job_id} | Get a summary of a submitted job.
|
+| /api/job/{job_id}/dot | Produce a query plan in DOT (graphviz) format.
|
diff --git a/docs/source/user-guide/tuning-guide.md
b/docs/source/user-guide/tuning-guide.md
index 3d270051..6b850e9b 100644
--- a/docs/source/user-guide/tuning-guide.md
+++ b/docs/source/user-guide/tuning-guide.md
@@ -69,3 +69,24 @@ Pull-based scheduling works in a similar way to Apache Spark
and push-based sche
The scheduling policy can be specified in the `--scheduler_policy` parameter
when starting the scheduler and executor
processes. The default is `pull-based`.
+
+## Viewing Query Plans and Metrics
+
+The scheduler provides a web user interface as well as a REST API for
monitoring jobs. See the
+[scheduler documentation](scheduler.md) for more information.
+
+To download a query plan in dot format from the scheduler, submit a request to
the following API endpoint
+
+```
+http://localhost:50050/api/job/{job_id}/dot
+```
+
+The resulting file can be converted into an image using `graphviz`:
+
+```bash
+dot -Tpng query.dot > query.png
+```
+
+Here is an example query plan:
+
+