[arrow-ballista] branch master updated: User Guide improvements (#274)

agrove Thu, 29 Sep 2022 06:55:47 -0700

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git



The following commit(s) were added to refs/heads/master by this push:
     new 90774daf User Guide improvements (#274)
90774daf is described below

commit 90774daf44c2c189f1b1d89ff316bdcb2e3757ad
Author: Andy Grove <[email protected]>
AuthorDate: Thu Sep 29 07:54:16 2022 -0600

    User Guide improvements (#274)
---
 docs/README.md                                     |  38 +++++++-----
 docs/build.sh                                      |  21 +++++++
 .../deployment => developer}/configuration.md      |   0
 docs/source/index.rst                              |  27 ++++++--
 docs/source/user-guide/cli.md                      |  69 +++++++++------------
 .../source/user-guide/deployment/docker-compose.md |   4 +-
 docs/source/user-guide/deployment/docker.md        |   4 +-
 docs/source/user-guide/deployment/index.rst        |   9 ++-
 docs/source/user-guide/flightsql.md                |   2 +-
 docs/source/user-guide/images/ballista-web-ui.png  | Bin 0 -> 234939 bytes
 .../user-guide/images/example-query-plan.png       | Bin 0 -> 192455 bytes
 docs/source/user-guide/introduction.md             |  29 +++++----
 docs/source/user-guide/python.md                   |  47 +++++++++++++-
 docs/source/user-guide/rust.md                     |   5 +-
 docs/source/user-guide/scheduler.md                |  36 +++++++++++
 docs/source/user-guide/tuning-guide.md             |  21 +++++++
 16 files changed, 220 insertions(+), 92 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index 920ba494..9813be4b 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -17,16 +17,20 @@
   under the License.
 -->
 
-# Developer Documentation
+# Ballista Documentation
 
-Developer documentation can be found [here](developer/README.md).
-User documentation can be found [here](source/user-guide/introduction.md).
+## User Documentation
+
+Documentation for the current published release can be found at 
https://arrow.apache.org/ballista and the source
+content is located [here](source/user-guide/introduction.md).
 
-# User Documentation
+## Developer Documentation
+
+Developer documentation can be found [here](developer/README.md).
 
-_These instructions were forked from the `arrow-datafusion` repository and are 
outdated_
+## Building the User Guide
 
-## Dependencies
+### Dependencies
 
 It's recommended to install build dependencies and build the documentation
 inside a Python virtualenv.
@@ -38,21 +42,21 @@ inside a Python virtualenv.
 ## Build
 
 ```bash
-make html
+./build.sh
 ```
 
 ## Release
 
-The documentation is served through the
-[arrow-site](https://github.com/apache/arrow-site/) repo. To release a new
-version of the docs, follow these steps:
+The documentation is served through the 
[arrow-site](https://github.com/apache/arrow-site/) repository. To release
+a new version of the documentation, follow these steps:
 
-1. Run `make html` inside `docs` folder to generate the docs website inside 
the `build/html` folder.
-2. Clone the arrow-site repo
-3. Checkout to the `asf-site` branch (NOT `master`)
-4. Copy build artifacts into `arrow-site` repo's `datafusion` folder with a 
command such as
+1. Download the release source tarball (we can only publish documentation from 
official releases)
+2. Run `./build.sh` inside `docs` folder to generate the docs website inside 
the `build/html` folder.
+3. Clone the arrow-site repo
+4. Checkout to the `asf-site` branch (NOT `master`)
+5. Copy build artifacts into `arrow-site` repo's `ballista` folder with a 
command such as
 
-- `cp -rT ./build/html/ ../../arrow-site/datafusion/` (doesn't work on mac)
-- `rsync -avzr ./build/html/ ../../arrow-site/datafusion/`
+- `cp -rT ./build/html/ ../../arrow-site/ballista/` (doesn't work on mac)
+- `rsync -avzr ./build/html/ ../../arrow-site/ballista/`
 
-5. Commit changes in `arrow-site` and send a PR.
+6. Commit changes in `arrow-site` and send a PR.
diff --git a/docs/build.sh b/docs/build.sh
new file mode 100755
index 00000000..ba25a933
--- /dev/null
+++ b/docs/build.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+rm -rf build
+make html
diff --git a/docs/source/user-guide/deployment/configuration.md 
b/docs/developer/configuration.md
similarity index 100%
rename from docs/source/user-guide/deployment/configuration.md
rename to docs/developer/configuration.md
diff --git a/docs/source/index.rst b/docs/source/index.rst
index a9343b1d..270c475e 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -29,11 +29,28 @@ Table of content
    :maxdepth: 1
    :caption: User Guide
 
-   user-guide/introduction
-   user-guide/deployment/index
-   user-guide/python
-   user-guide/rust
-   user-guide/cli
+   Introduction <user-guide/introduction>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Cluster Deployment
+
+   Deployment <user-guide/deployment/index>
+   Scheduler <user-guide/scheduler>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Clients
+
+   Python <user-guide/python>
+   Rust <user-guide/rust>
+   Flight SQL JDBC <user-guide/flightsql>
+   SQL CLI <user-guide/cli>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Reference
+
    user-guide/configs
    user-guide/tuning-guide
    user-guide/faq
diff --git a/docs/source/user-guide/cli.md b/docs/source/user-guide/cli.md
index 2ec9c8d4..f8bc0805 100644
--- a/docs/source/user-guide/cli.md
+++ b/docs/source/user-guide/cli.md
@@ -17,27 +17,35 @@
   under the License.
 -->
 
-# DataFusion Command-line Interface
+# Ballista Command-line Interface
 
-The DataFusion CLI allows SQL queries to be executed by an in-process 
DataFusion context, or by a distributed
-Ballista context.
+The Ballista CLI allows SQL queries to be executed against a Ballista cluster, 
or in standalone mode in a single
+process.
 
+Use Cargo to install:
+
+```bash
+cargo install ballista-cli
 ```
-USAGE:
-    datafusion-cli [FLAGS] [OPTIONS]
 
-FLAGS:
-    -h, --help       Prints help information
-    -q, --quiet      Reduce printing other than the results and work quietly
-    -V, --version    Prints version information
+## Usage
+
+```
+USAGE:
+    ballista-cli [OPTIONS]
 
 OPTIONS:
-    -c, --batch-size <batch-size>    The batch size of each query, or use 
DataFusion default
-    -p, --data-path <data-path>      Path to your data, default to current 
directory
-    -f, --file <file>...             Execute commands from file(s), then exit
-        --format <format>            Output format [default: table]  [possible 
values: csv, tsv, table, json, ndjson]
-        --host <host>                Ballista scheduler host
-        --port <port>                Ballista scheduler port
+    -c, --batch-size <BATCH_SIZE>    The batch size of each query, or use 
DataFusion default
+    -f, --file <FILE>...             Execute commands from file(s), then exit
+        --format <FORMAT>            [default: table] [possible values: csv, 
tsv, table, json,
+                                     nd-json]
+    -h, --help                       Print help information
+        --host <HOST>                Ballista scheduler host
+    -p, --data-path <DATA_PATH>      Path to your data, default to current 
directory
+        --port <PORT>                Ballista scheduler port
+    -q, --quiet                      Reduce printing other than the results 
and work quietly
+    -r, --rc <RC>...                 Run the provided files on startup instead 
of ~/.datafusionrc
+    -V, --version                    Print version information
 ```
 
 ## Example
@@ -48,10 +56,12 @@ Create a CSV file to query.
 $ echo "1,2" > data.csv
 ```
 
+## Run Ballista CLI in Standalone Mode
+
 ```bash
-$ datafusion-cli
+$ ballista-cli
 
-DataFusion CLI v8.0.0
+Ballista CLI v8.0.0
 
 > CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
 0 rows in set. Query took 0.001 seconds.
@@ -65,28 +75,9 @@ DataFusion CLI v8.0.0
 1 row in set. Query took 0.017 seconds.
 ```
 
-## DataFusion-Cli
-
-Build the `datafusion-cli` without the feature of ballista.
-
-```bash
-cd arrow-datafusion/datafusion-cli
-cargo build
-```
-
-## Ballista
-
-The DataFusion CLI can also connect to a Ballista scheduler for query 
execution.
-
-Before you use the `datafusion-cli` to connect the Ballista scheduler, you 
should build/compile
-the `datafusion-cli` with feature of "ballista" first.
-
-```bash
-cd arrow-datafusion/datafusion-cli
-cargo build --features ballista
-```
+## Run Ballista CLI in Distributed Mode
 
-Then, you can connect the Ballista by below command.
+The CLI can also connect to a Ballista scheduler for query execution.
 
 ```bash
 datafusion-cli --host localhost --port 50050
@@ -94,7 +85,7 @@ datafusion-cli --host localhost --port 50050
 
 ## Cli commands
 
-Available commands inside DataFusion CLI are:
+Available commands inside Ballista CLI are:
 
 - Quit
 
diff --git a/docs/source/user-guide/deployment/docker-compose.md 
b/docs/source/user-guide/deployment/docker-compose.md
index 1a2f374a..8e092ac5 100644
--- a/docs/source/user-guide/deployment/docker-compose.md
+++ b/docs/source/user-guide/deployment/docker-compose.md
@@ -28,8 +28,8 @@ There is no officially published Docker image so it is 
currently necessary to bu
 Run the following commands to clone the source repository and build the Docker 
image.
 
 ```bash
-git clone [email protected]:apache/arrow-datafusion.git -b 8.0.0
-cd arrow-datafusion
+git clone [email protected]:apache/arrow-ballista.git -b 8.0.0
+cd arrow-ballista
 ./dev/build-ballista-docker.sh
 ```
 
diff --git a/docs/source/user-guide/deployment/docker.md 
b/docs/source/user-guide/deployment/docker.md
index c6188aae..f5de816e 100644
--- a/docs/source/user-guide/deployment/docker.md
+++ b/docs/source/user-guide/deployment/docker.md
@@ -26,8 +26,8 @@ There is no officially published Docker image so it is 
currently necessary to bu
 Run the following commands to clone the source repository and build the Docker 
image.
 
 ```bash
-git clone [email protected]:apache/arrow-datafusion.git -b 8.0.0
-cd arrow-datafusion
+git clone [email protected]:apache/arrow-ballista.git -b 8.0.0
+cd arrow-ballista
 ./dev/build-ballista-docker.sh
 ```
 
diff --git a/docs/source/user-guide/deployment/index.rst 
b/docs/source/user-guide/deployment/index.rst
index ad9c0714..73b3a8e3 100644
--- a/docs/source/user-guide/deployment/index.rst
+++ b/docs/source/user-guide/deployment/index.rst
@@ -21,8 +21,7 @@ Start a Ballista Cluster
 .. toctree::
    :maxdepth: 2
 
-   cargo-install
-   docker
-   docker-compose
-   kubernetes
-   configuration
+   Cargo <cargo-install>
+   Docker <docker>
+   Docker Compose <docker-compose>
+   Kubernetes <kubernetes>
diff --git a/docs/source/user-guide/flightsql.md 
b/docs/source/user-guide/flightsql.md
index cd7b9fce..57410642 100644
--- a/docs/source/user-guide/flightsql.md
+++ b/docs/source/user-guide/flightsql.md
@@ -117,7 +117,7 @@ To register a table, find a `.csv`, `.json`, or `.parquet` 
file for testing, and
 
 ```sql
 create external table customer stored as CSV with header row
-    location 
'/home/username/arrow-datafusion/datafusion/core/tests/tpch-csv/customer.csv';
+    location '/path/to/customer.csv';
 ```
 
 Once the table has been registered, all the normal SQL queries can be 
performed:
diff --git a/docs/source/user-guide/images/ballista-web-ui.png 
b/docs/source/user-guide/images/ballista-web-ui.png
new file mode 100644
index 00000000..9784d144
Binary files /dev/null and b/docs/source/user-guide/images/ballista-web-ui.png 
differ
diff --git a/docs/source/user-guide/images/example-query-plan.png 
b/docs/source/user-guide/images/example-query-plan.png
new file mode 100644
index 00000000..6ead4e62
Binary files /dev/null and 
b/docs/source/user-guide/images/example-query-plan.png differ
diff --git a/docs/source/user-guide/introduction.md 
b/docs/source/user-guide/introduction.md
index db7d0e13..65cbe2f7 100644
--- a/docs/source/user-guide/introduction.md
+++ b/docs/source/user-guide/introduction.md
@@ -19,16 +19,19 @@
 
 # Overview
 
-Ballista is a distributed compute platform primarily implemented in Rust, and 
powered by Apache Arrow. It is
-built on an architecture that allows other programming languages to be 
supported as first-class citizens without paying
-a penalty for serialization costs.
+Ballista is a distributed compute platform primarily implemented in Rust, and 
powered by Apache Arrow.
 
-The foundational technologies in Ballista are:
+Ballista has a scheduler and an executor process that are standard Rust 
executables and can be executed directly, but
+Dockerfiles are provided to build images for use in containerized 
environments, such as Docker, Docker Compose, and
+Kubernetes. See the [deployment guide](deployment.md) for more information
 
-- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels 
for efficient processing of data.
-- [Apache Arrow Flight 
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) 
for efficient data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) 
for serializing query plans.
-- [DataFusion](https://github.com/apache/arrow-datafusion/) for query 
execution.
+SQL and DataFrame queries can be submitted from Python and Rust, and SQL 
queries can be submitted via the Arrow
+Flight SQL JDBC driver, supporting your favorite JDBC compliant tools such as 
[DataGrip](datagrip)
+or [tableau](tableau). For setup instructions, please see the [FlightSQL 
guide](flightsql.md).
+
+The scheduler has a web user interface for monitoring query status as well as 
a REST API.
+
+![Ballista Scheduler Web UI](./images/ballista-web-ui.png)
 
 ## How does this compare to Apache Spark?
 
@@ -45,10 +48,6 @@ Although Ballista is largely inspired by Apache Spark, there 
are some key differ
 - The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged between executors
   in any programming language with minimal serialization overhead.
 
-## Status
-
-Ballista is still in the early stages of development but is capable of 
executing complex analytical queries at scale.
-
-## Usage
-
-Ballista can be used from your favorite JDBC compliant tools such as 
[DataGrip](https://www.jetbrains.com/datagrip/) or 
[tableau](https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases_jdbc.htm).
 For setup instructions, please see the [FlightSQL guide](flightsql.md).
+[deployment](./deployment)
+[datagrip](https://www.jetbrains.com/datagrip/)
+[tableau](https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases_jdbc.htm)
diff --git a/docs/source/user-guide/python.md b/docs/source/user-guide/python.md
index 3bd4fe50..4a907566 100644
--- a/docs/source/user-guide/python.md
+++ b/docs/source/user-guide/python.md
@@ -21,6 +21,9 @@
 
 Ballista provides Python bindings, allowing SQL and DataFrame queries to be 
executed from the Python shell.
 
+Like PySpark, it allows you to build a plan through SQL or a DataFrame API 
against Parquet, CSV, JSON, and other
+popular file formats files, run it in a distributed environment, and obtain 
the result back in Python.
+
 ## Connecting to a Cluster
 
 The following code demonstrates how to create a Ballista context and connect 
to a scheduler.
@@ -30,7 +33,13 @@ The following code demonstrates how to create a Ballista 
context and connect to
 >>> ctx = ballista.BallistaContext("localhost", 50050)
 ```
 
-## Registering Tables
+## SQL
+
+The Python bindings support executing SQL queries as well.
+
+### Registering Tables
+
+Before SQL queries can be executed, tables need to be registered with the 
context.
 
 Tables can be registered against the context by calling one of the `register` 
methods, or by executing SQL.
 
@@ -42,7 +51,7 @@ Tables can be registered against the context by calling one 
of the `register` me
 >>> ctx.sql("CREATE EXTERNAL TABLE trips STORED AS PARQUET LOCATION 
 >>> '/mnt/bigdata/nyctaxi'")
 ```
 
-## Executing Queries
+### Executing Queries
 
 The `sql` method creates a `DataFrame`. The query is executed when an action 
such as `show` or `collect` is executed.
 
@@ -88,3 +97,37 @@ The `explain` method can be used to show the logical and 
physical query plans fo
 |               |                                                             |
 +---------------+-------------------------------------------------------------+
 ```
+
+## DataFrame
+
+The following example demonstrates creating arrays with PyArrow and then 
creating a Ballista DataFrame.
+
+```python
+import ballista
+import pyarrow
+
+# an alias
+f = ballista.functions
+
+# create a context
+ctx = ballista.BallistaContext("localhost", 50050)
+
+# create a RecordBatch and a new DataFrame from it
+batch = pyarrow.RecordBatch.from_arrays(
+    [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
+    names=["a", "b"],
+)
+df = ctx.create_dataframe([[batch]])
+
+# create a new statement
+df = df.select(
+    f.col("a") + f.col("b"),
+    f.col("a") - f.col("b"),
+)
+
+# execute and collect the first (and only) batch
+result = df.collect()[0]
+
+assert result.column(0) == pyarrow.array([5, 7, 9])
+assert result.column(1) == pyarrow.array([-3, -3, -3])
+```
diff --git a/docs/source/user-guide/rust.md b/docs/source/user-guide/rust.md
index 8788d058..e3015e76 100644
--- a/docs/source/user-guide/rust.md
+++ b/docs/source/user-guide/rust.md
@@ -19,10 +19,7 @@
 
 # Ballista Rust Client
 
-Ballista usage is very similar to DataFusion. Tha main difference is that the 
starting point is a `BallistaContext`
-instead of the DataFusion `SessionContext`. Ballista uses the same DataFrame 
API as DataFusion.
-
-The following code sample demonstrates how to create a `BallistaContext` to 
connect to a Ballista scheduler process.
+To connect to a Ballista cluster from Rust, first start by creating a 
`BallistaContext`.
 
 ```rust
 let config = BallistaConfig::builder()
diff --git a/docs/source/user-guide/scheduler.md 
b/docs/source/user-guide/scheduler.md
new file mode 100644
index 00000000..21d8f876
--- /dev/null
+++ b/docs/source/user-guide/scheduler.md
@@ -0,0 +1,36 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Ballista Scheduler
+
+## Web User Interface
+
+The scheduler provides a web user interface that allows queries to be 
monitored.
+
+![Ballista Scheduler Web UI](./images/ballista-web-ui.png)
+
+## REST API
+
+The scheduler also provides a REST API that allows jobs to be monitored.
+
+| API                   | Description                                          
       |
+| --------------------- | 
----------------------------------------------------------- |
+| /api/jobs             | Get a list of jobs that have been submitted to the 
cluster. |
+| /api/job/{job_id}     | Get a summary of a submitted job.                    
       |
+| /api/job/{job_id}/dot | Produce a query plan in DOT (graphviz) format.       
       |
diff --git a/docs/source/user-guide/tuning-guide.md 
b/docs/source/user-guide/tuning-guide.md
index 3d270051..6b850e9b 100644
--- a/docs/source/user-guide/tuning-guide.md
+++ b/docs/source/user-guide/tuning-guide.md
@@ -69,3 +69,24 @@ Pull-based scheduling works in a similar way to Apache Spark 
and push-based sche
 
 The scheduling policy can be specified in the `--scheduler_policy` parameter 
when starting the scheduler and executor
 processes. The default is `pull-based`.
+
+## Viewing Query Plans and Metrics
+
+The scheduler provides a web user interface as well as a REST API for 
monitoring jobs. See the
+[scheduler documentation](scheduler.md) for more information.
+
+To download a query plan in dot format from the scheduler, submit a request to 
the following API endpoint
+
+```
+http://localhost:50050/api/job/{job_id}/dot
+```
+
+The resulting file can be converted into an image using `graphviz`:
+
+```bash
+dot -Tpng query.dot > query.png
+```
+
+Here is an example query plan:
+
+![query plan](images/example-query-plan.png)

[arrow-ballista] branch master updated: User Guide improvements (#274)

Reply via email to