(arrow-ballista) branch main updated: New contributor guide documentation (architecture guide & code organization) (#977)

agrove Sun, 11 Feb 2024 10:07:11 -0800

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git



The following commit(s) were added to refs/heads/main by this push:
     new ca590dbd New contributor guide documentation (architecture guide & 
code organization) (#977)
ca590dbd is described below

commit ca590dbd79d90741d471ef54ed5d0a410d3a0a90
Author: Andy Grove <[email protected]>
AuthorDate: Sun Feb 11 11:06:10 2024 -0700

    New contributor guide documentation (architecture guide & code 
organization) (#977)
---
 README.md                                          |  64 ++-----
 ROADMAP.md                                         |  58 ++++++
 docs/source/contributors-guide/architecture.md     | 205 +++++++++++++++++++++
 docs/source/contributors-guide/ballista.drawio.png | Bin 0 -> 73216 bytes
 .../source/contributors-guide/code-organization.md |  55 ++++++
 .../development.md                                 |  11 +-
 docs/source/index.rst                              |  13 +-
 7 files changed, 350 insertions(+), 56 deletions(-)

diff --git a/README.md b/README.md
index 7cfec476..0fe1c97a 100644
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@
 # Ballista: Distributed SQL Query Engine, built on Apache Arrow
 
 Ballista is a distributed SQL query engine powered by the Rust implementation 
of [Apache Arrow][arrow] and
-[DataFusion][datafusion].
+[Apache Arrow DataFusion][datafusion].
 
 If you are looking for documentation for a released version of Ballista, 
please refer to the
 [Ballista User Guide][user-guide].
@@ -41,6 +41,20 @@ Ballista implements a similar design to Apache Spark 
(particularly Spark SQL), b
   executors using the [Flight Protocol][flight], and between clients and 
schedulers/executors using the
   [Flight SQL Protocol][flight-sql]
 
+## Architecture
+
+A Ballista cluster consists of one or more scheduler processes and one or more 
executor processes. These processes
+can be run as native binaries and are also available as Docker Images, which 
can be easily deployed with
+[Docker 
Compose](https://arrow.apache.org/ballista/user-guide/deployment/docker-compose.html)
 or
+[Kubernetes](https://arrow.apache.org/ballista/user-guide/deployment/kubernetes.html).
+
+The following diagram shows the interaction between clients and the scheduler 
for submitting jobs, and the interaction
+between the executor(s) and the scheduler for fetching tasks and reporting 
task status.
+
+![Ballista Cluster Diagram](docs/source/contributors-guide/ballista.drawio.png)
+
+See the [architecture guide](docs/source/contributors-guide/architecture.md) 
for more details.
+
 ## Features
 
 - Supports HDFS as well as cloud object stores. S3 is supported today and GCS 
and Azure support is planned.
@@ -72,53 +86,7 @@ Ballista supports a wide range of SQL, including CTEs, 
Joins, and Subqueries and
 Refer to the [DataFusion SQL 
Reference](https://arrow.apache.org/datafusion/user-guide/sql/index.html) for 
more
 information on supported SQL.
 
-Ballista is maturing quickly and is now working towards being production 
ready. See the following roadmap for more details.
-
-## Roadmap
-
-There is an excellent discussion in 
https://github.com/apache/arrow-ballista/issues/30 about the future of the 
project,
-and we encourage you to participate and add your feedback there if you are 
interested in using or contributing to
-Ballista.
-
-The current focus is on the following items:
-
-- Make production ready
-  - Shuffle file cleanup
-    - Periodically 
([#185](https://github.com/apache/arrow-ballista/issues/185))
-    - Add gRPC & REST interfaces for clients/UI to actively call the cleanup 
for a job or the whole system
-  - Fill functional gaps between DataFusion and Ballista
-  - Improve task scheduling and data exchange efficiency
-  - Better error handling
-    - Scheduler restart
-  - Improve monitoring, logging, and metrics
-  - Auto scaling support
-  - Better configuration management
-  - Support for multi-scheduler deployments. Initially for resiliency and 
fault tolerance but ultimately to support
-    sharding for scalability and more efficient caching.
-- Shuffle improvement
-  - Shuffle memory control 
([#320](https://github.com/apache/arrow-ballista/issues/320))
-  - Improve shuffle IO to avoid producing too many files
-  - Support sort-based shuffle
-  - Support range partition
-  - Support broadcast shuffle 
([#342](https://github.com/apache/arrow-ballista/issues/342))
-- Scheduler Improvements
-  - All-at-once job task scheduling
-  - Executor deployment grouping based on resource allocation
-- Cloud Support
-  - Support Azure Blob Storage 
([#294](https://github.com/apache/arrow-ballista/issues/294))
-  - Support Google Cloud Storage 
([#293](https://github.com/apache/arrow-ballista/issues/293))
-- Performance and scalability
-  - Implement Adaptive Query Execution 
([#387](https://github.com/apache/arrow-ballista/issues/387))
-  - Implement bubble execution 
([#408](https://github.com/apache/arrow-ballista/issues/408))
-  - Improve benchmark results 
([#339](https://github.com/apache/arrow-ballista/issues/339))
-- Python Support
-  - Support Python UDFs 
([#173](https://github.com/apache/arrow-ballista/issues/173))
-
-## Architecture Overview
-
-There are currently no up-to-date architecture documents available. You can 
get a general overview of the architecture
-by watching the [Ballista: Distributed Compute with Rust and Apache 
Arrow][ballista-talk] talk from the New York Open
-Statistical Programming Meetup (Feb 2021).
+Ballista is maturing quickly and is now working towards being production 
ready. See the [roadmap](ROADMAP.md) for more details.
 
 ## Contribution Guide
 
diff --git a/ROADMAP.md b/ROADMAP.md
new file mode 100644
index 00000000..3bcd34d4
--- /dev/null
+++ b/ROADMAP.md
@@ -0,0 +1,58 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+## Ballista Roadmap
+
+There is an excellent discussion in 
https://github.com/apache/arrow-ballista/issues/30 about the future of the 
project,
+and we encourage you to participate and add your feedback there if you are 
interested in using or contributing to
+Ballista.
+
+The current focus is on the following items:
+
+- Make production ready
+  - Shuffle file cleanup
+    - Periodically 
([#185](https://github.com/apache/arrow-ballista/issues/185))
+    - Add gRPC & REST interfaces for clients/UI to actively call the cleanup 
for a job or the whole system
+  - Fill functional gaps between DataFusion and Ballista
+  - Improve task scheduling and data exchange efficiency
+  - Better error handling
+    - Scheduler restart
+  - Improve monitoring, logging, and metrics
+  - Auto scaling support
+  - Better configuration management
+  - Support for multi-scheduler deployments. Initially for resiliency and 
fault tolerance but ultimately to support
+    sharding for scalability and more efficient caching.
+- Shuffle improvement
+  - Shuffle memory control 
([#320](https://github.com/apache/arrow-ballista/issues/320))
+  - Improve shuffle IO to avoid producing too many files
+  - Support sort-based shuffle
+  - Support range partition
+  - Support broadcast shuffle 
([#342](https://github.com/apache/arrow-ballista/issues/342))
+- Scheduler Improvements
+  - All-at-once job task scheduling
+  - Executor deployment grouping based on resource allocation
+- Cloud Support
+  - Support Azure Blob Storage 
([#294](https://github.com/apache/arrow-ballista/issues/294))
+  - Support Google Cloud Storage 
([#293](https://github.com/apache/arrow-ballista/issues/293))
+- Performance and scalability
+  - Implement Adaptive Query Execution 
([#387](https://github.com/apache/arrow-ballista/issues/387))
+  - Implement bubble execution 
([#408](https://github.com/apache/arrow-ballista/issues/408))
+  - Improve benchmark results 
([#339](https://github.com/apache/arrow-ballista/issues/339))
+- Python Support
+  - Support Python UDFs 
([#173](https://github.com/apache/arrow-ballista/issues/173))
diff --git a/docs/source/contributors-guide/architecture.md 
b/docs/source/contributors-guide/architecture.md
new file mode 100644
index 00000000..2867fa45
--- /dev/null
+++ b/docs/source/contributors-guide/architecture.md
@@ -0,0 +1,205 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Ballista Architecture
+
+## Overview
+
+Ballista’s primary purpose is to provide a distributed SQL query engine 
implemented in the Rust programming
+language and using the Apache Arrow memory model.
+
+Ballista also provides a DataFrame API (both in Rust and Python), suitable for 
constructing ETL pipelines and
+analytical queries. The DataFrame API is inspired by Apache Spark and is 
currently better suited for ETL/SQL work
+than for data science.
+
+## Design Principles
+
+### Arrow-native
+
+Ballista uses the Apache Arrow memory format during query execution, and 
Apache Arrow IPC format on disk for
+shuffle files and for exchanging data between executors. Queries can be 
submitted using the Arrow Flight SQL API
+and the Arrow Flight SQL JDBC Driver.
+
+### Language Agnostic
+
+Although most of the implementation code is written in Rust, the scheduler and 
executor APIs are based on open
+standards, including protocol buffers, gRPC, Apache Arrow IPC, and Apache 
Arrow Flight SQL.
+
+This language agnostic approach will allow Ballista to eventually support UDFs 
in languages other than Rust,
+including Wasm.
+
+### Extensible
+
+Many Ballista users have their own distributed query engines that use Ballista 
as a foundation, rather than
+using Ballista directly. This allows the scheduler and executor processes to 
be extended with support for
+additional data formats, operators, expressions, or custom SQL dialects or 
other DSLs.
+
+Ballista uses the DataFusion query engine for query execution, but it should 
be possible to plug in other execution
+engines.
+
+## Deployment Architecture
+
+### Cluster
+
+A Ballista cluster consists of one or more scheduler processes and one or more 
executor processes. These processes
+can be run as native binaries and are also available as Docker Images, which 
can be easily deployed with
+[Docker 
Compose](https://arrow.apache.org/ballista/user-guide/deployment/docker-compose.html)
 or
+[Kubernetes](https://arrow.apache.org/ballista/user-guide/deployment/kubernetes.html).
+
+The following diagram shows the interaction between clients and the scheduler 
for submitting jobs, and the interaction
+between the executor(s) and the scheduler for fetching tasks and reporting 
task status.
+
+![Ballista Cluster Diagram](ballista.drawio.png)
+
+### Scheduler
+
+The scheduler provides the following interfaces:
+
+- gRPC service for submitting and managing jobs
+- Flight SQL API
+- REST API for monitoring jobs
+- Web user interface for monitoring jobs
+
+Jobs are submitted to the scheduler's gRPC service from a client context, 
either in the form of a logical query
+plan or a SQL string. The scheduler then creates an execution graph, which 
contains a physical plan broken down into
+stages (pipelines) that can be scheduled independently. This process is 
explained in detail in the Distributed
+Query Scheduling section of this guide.
+
+It is possible to have multiple schedulers running with shared state in etcd, 
so that jobs can continue to run
+even if a scheduler process fails.
+
+### Executor
+
+The executor processes connect to a scheduler and poll for tasks to perform. 
These tasks are physical plans in
+protocol buffer format. These physical plans are typically executed against 
multiple partitions of input data. Executors
+can execute multiple partitions of the same plan in parallel.
+
+### Clients
+
+There are multiple clients available for submitting jobs to a Ballista cluster:
+
+- The [Ballista 
CLI](https://github.com/apache/arrow-ballista/tree/main/ballista-cli) provides 
a SQL command-line
+  interface.
+- The Python bindings 
([PyBallista](https://github.com/apache/arrow-ballista/tree/main/python)) 
provide a session
+  context with support for SQL and DataFrame operations.
+- The [ballista crate](https://crates.io/crates/ballista) provides a native 
Rust session context with support for
+  SQL and DataFrame operations.
+- The [Flight SQL JDBC 
driver](https://arrow.apache.org/docs/java/flight_sql_jdbc_driver.html) can be 
used from
+  popular SQL tools to execute SQL queries against a cluster.
+
+## Distributed Query Scheduling
+
+Distributed query plans are fundamentally different to in-process query plans 
because we can't just build a
+tree of operators and start executing them. The query now requires 
co-ordination across executors which means that
+we now need a scheduler.
+
+At a high level, the concept of a distributed query scheduler is not complex. 
The scheduler needs to examine the
+whole query and break it down into stages that can be executed in isolation 
(usually in parallel across the executors)
+and then schedule these stages for execution based on the available resources 
in the cluster. Once each query
+stage completes then any subsequent dependent query stages can be scheduled. 
This process repeats until all query
+stages have been executed.
+
+### Producing a Distributed Query Plan
+
+Some operators can run in parallel on input partitions and some operators 
require data to be repartitioned. These
+changes in partitioning are key to planning a distributed query. Changes in 
partitioning within a plan are sometimes
+called pipeline breakers and these changes in partitioning define the 
boundaries between query stages.
+
+We will now use the following SQL query to see how this process works.
+
+```sql
+SELECT customer.id, sum(order.amount) as total_amount
+FROM customer JOIN order ON customer.id = order.customer_id
+GROUP BY customer.id
+```
+
+The physical (non-distributed) plan for this query would look something like 
this:
+
+```
+Projection: #customer.id, #total_amount
+  HashAggregate: groupBy=[customer.id], aggr=[MAX(max_fare) AS total_amount]
+    Join: condition=[customer.id = order.customer_id]
+      Scan: customer
+      Scan: order
+```
+
+Assuming that the customer and order tables are not already partitioned on 
customer id, we will need to schedule
+execution of the first two query stages to repartition this data. These two 
query stages can run in parallel.
+
+```
+Query Stage #1: repartition=[customer.id]
+  Scan: customer
+Query Stage #2: repartition=[order.customer_id]
+  Scan: order
+```
+
+Next, we can schedule the join, which will run in parallel for each partition 
of the two inputs. The next operator
+after the join is the aggregate, which is split into two parts; the aggregate 
that runs in parallel and then
+the final aggregate that requires a single input partition. We can perform the 
parallel part of this aggregate
+in the same query stage as the join because this first aggregate does not care 
how the data is partitioned. This
+gives us our third query stage, which can now be scheduled for execution. The 
output of this query stage
+remains partitioned by customer id.
+
+```
+Query Stage #3: repartition=[]
+  HashAggregate: groupBy=[customer.id], aggr=[MAX(max_fare) AS total_amount]
+    Join: condition=[customer.id = order.customer_id]
+      Query Stage #1
+      Query Stage #2
+```
+
+The final query stage performs the aggregate of the aggregates, reading from 
all of the partitions from the previous
+stage.
+
+```
+Query Stage #4:
+  Projection: #customer.id, #total_amount
+    HashAggregate: groupBy=[customer.id], aggr=[MAX(max_fare) AS total_amount]
+      QueryStage #3
+```
+
+To recap, here is the full distributed query plan showing the query stages 
that are introduced when data needs to be
+repartitioned or exchanged between pipelined operations.
+
+```
+Query Stage #4:
+  Projection: #customer.id, #total_amount
+    HashAggregate: groupBy=[customer.id], aggr=[MAX(max_fare) AS total_amount]
+      Query Stage #3: repartition=[]
+        HashAggregate: groupBy=[customer.id], aggr=[MAX(max_fare) AS 
total_amount]
+          Join: condition=[customer.id = order.customer_id]
+            Query Stage #1: repartition=[customer.id]
+              Scan: customer
+            Query Stage #2: repartition=[order.customer_id]
+              Scan: order
+```
+
+### Shuffle
+
+Each stage of the execution graph has the same partitioning scheme for all of 
the operators in the plan. However,
+the output of each stage typically needs to be repartitioned before it can be 
used as the input to the next stage. An
+example of this is when a query contains multiple joins. Data needs to be 
partitioned by the join keys before the join
+can be performed.
+
+Each executor will re-partition the output of the stage it is running so that 
it can be consumed by the next
+stage. This mechanism is known as an Exchange or a Shuffle. The logic for this 
can be found in the [ShuffleWriterExec]
+and [ShuffleReaderExec] operators.
+
+[shufflewriterexec]: 
https://github.com/apache/arrow-ballista/blob/main/ballista/core/src/execution_plans/shuffle_writer.rs
+[shufflereaderexec]: 
https://github.com/apache/arrow-ballista/blob/main/ballista/core/src/execution_plans/shuffle_reader.rs
diff --git a/docs/source/contributors-guide/ballista.drawio.png 
b/docs/source/contributors-guide/ballista.drawio.png
new file mode 100644
index 00000000..7d924a14
Binary files /dev/null and b/docs/source/contributors-guide/ballista.drawio.png 
differ
diff --git a/docs/source/contributors-guide/code-organization.md 
b/docs/source/contributors-guide/code-organization.md
new file mode 100644
index 00000000..6b830589
--- /dev/null
+++ b/docs/source/contributors-guide/code-organization.md
@@ -0,0 +1,55 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+## Ballista Code Organization
+
+This section provides links to the source code for major areas of 
functionality.
+
+### ballista-core crate
+
+- [Crate 
Source](https://github.com/apache/arrow-ballista/blob/main/ballista/core)
+- [Protocol Buffer 
Definition](https://github.com/apache/arrow-ballista/blob/main/ballista/core/proto/ballista.proto)
+- [Execution 
Plans](https://github.com/apache/arrow-ballista/tree/main/ballista/core/src/execution_plans)
+- [Ballista 
Client](https://github.com/apache/arrow-ballista/blob/main/ballista/core/src/client.rs)
+
+### ballista-scheduler crate
+
+- [Crate 
Source](https://github.com/apache/arrow-ballista/tree/main/ballista/scheduler)
+- [Distributed Query 
Planner](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/planner.rs)
+- [gRPC 
Service](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/scheduler_server/grpc.rs)
+- [Flight SQL 
Service](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/flight_sql.rs)
+- [REST 
API](https://github.com/apache/arrow-ballista/tree/main/ballista/scheduler/src/api)
+- [Web 
UI](https://github.com/apache/arrow-ballista/tree/main/ballista/scheduler/ui)
+- [Prometheus 
Integration](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/metrics/prometheus.rs)
+
+### ballista-executor crate
+
+- [Crate 
Source](https://github.com/apache/arrow-ballista/tree/main/ballista/executor)
+- [Flight 
Service](https://github.com/apache/arrow-ballista/blob/main/ballista/executor/src/flight_service.rs)
+- [Executor 
Server](https://github.com/apache/arrow-ballista/blob/main/ballista/executor/src/executor_server.rs)
+
+### ballista crate
+
+- [Crate 
Source](https://github.com/apache/arrow-ballista/tree/main/ballista/client)
+- 
[Context](https://github.com/apache/arrow-ballista/blob/main/ballista/client/src/context.rs)
+
+### PyBallista
+
+- [Source](https://github.com/apache/arrow-ballista/tree/main/python)
+- 
[Context](https://github.com/apache/arrow-ballista/blob/main/python/src/context.rs)
diff --git a/docs/source/community/development.md 
b/docs/source/contributors-guide/development.md
similarity index 79%
rename from docs/source/community/development.md
rename to docs/source/contributors-guide/development.md
index 21527f3a..5d9ca5b7 100644
--- a/docs/source/community/development.md
+++ b/docs/source/contributors-guide/development.md
@@ -28,9 +28,14 @@ 
conduct](https://www.apache.org/foundation/policies/conduct.html).
 
 ## Development Environment
 
-The easiest way to get started if you are using VSCode or IntelliJ IDEA is to 
open the provided [Dev Container](https://containers.dev/overview) which will 
install all the required dependencies including Rust, Docker, Node.js and Yarn. 
A Dev Container is a development environment that runs in a Docker container. 
It is configured with all the required dependencies to build and test the 
project. It also includes VS Code and the Rust and Node.js extensions. Other 
supporting tools that use D [...]
-
-To use the Dev Container, open the project in VS Code and then click the 
"Reopen in Container" button in the bottom right corner of the IDE.
+The easiest way to get started if you are using VSCode or IntelliJ IDEA is to 
open the provided [Dev Container](https://containers.dev/overview)
+which will install all the required dependencies including Rust, Docker, 
Node.js and Yarn. A Dev Container is a
+development environment that runs in a Docker container. It is configured with 
all the required dependencies to
+build and test the project. It also includes VS Code and the Rust and Node.js 
extensions. Other supporting tools
+that use Dev Containers can be seen [here](https://containers.dev/supporting)
+
+To use the Dev Container, open the project in VS Code and then click the 
"Reopen in Container" button in the
+bottom right corner of the IDE.
 
 If you are not using the Dev Container or VScode, you will need to install 
these dependencies yourself.
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 994a9812..60e810b3 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -53,15 +53,19 @@ Table of content
 
    user-guide/configs
    user-guide/tuning-guide
+   user-guide/metrics
    user-guide/faq
 
-.. _toc.source:
+.. _toc.contributors:
 
 .. toctree::
    :maxdepth: 1
-   :caption: Source Code
+   :caption: Contributors Guide
 
-   Ballista <https://github.com/apache/arrow-ballista/>
+   contributors-guide/architecture
+   contributors-guide/code-organization
+   contributors-guide/development
+   Source code <https://github.com/apache/arrow-ballista/>
 
 .. _toc.community:
 
@@ -70,7 +74,6 @@ Table of content
    :caption: Community
 
    community/communication
-   community/development
-   
+
    Issue tracker <https://github.com/apache/arrow-ballista/issues>
    Code of conduct 
<https://github.com/apache/arrow-ballista/blob/main/CODE_OF_CONDUCT.md>

(arrow-ballista) branch main updated: New contributor guide documentation (architecture guide & code organization) (#977)

Reply via email to