This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git
The following commit(s) were added to refs/heads/master by this push:
new 57170d81 Replace README with Ballista version (#4)
57170d81 is described below
commit 57170d818b49506dcc5a4525b595bebf92f37bad
Author: Andy Grove <[email protected]>
AuthorDate: Thu May 19 12:35:26 2022 -0600
Replace README with Ballista version (#4)
---
README.md | 97 +++++++++++++++++++-----------------------
ballista/README.md | 71 -------------------------------
ballista/rust/client/README.md | 2 +-
3 files changed, 44 insertions(+), 126 deletions(-)
diff --git a/README.md b/README.md
index 58a511da..28477d12 100644
--- a/README.md
+++ b/README.md
@@ -17,79 +17,68 @@
under the License.
-->
-# DataFusion
+_Please note that Ballista development is still happening in the
+[DataFusion repository](https://github.com/apache/arrow-datafusion) but we are
in the
+process of migrating to this new repository._
-<img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg"
width="256"/>
+# Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion
-DataFusion is an extensible query execution framework, written in
-Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
-in-memory format.
+Ballista is a distributed compute platform primarily implemented in Rust, and
powered by Apache Arrow and
+DataFusion. It is built on an architecture that allows other programming
languages (such as Python, C++, and
+Java) to be supported as first-class citizens without paying a penalty for
serialization costs.
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+The foundational technologies in Ballista are:
-DataFusion also supports distributed query execution via the
-[Ballista](ballista/README.md) crate.
+- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels
for efficient processing of data.
+- [Apache Arrow Flight
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
for efficient
+ data transfer between processes.
+- [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
for serializing query plans.
+- [Docker](https://www.docker.com/) for packaging up executors along with
user-defined code.
-## Use Cases
+Ballista can be deployed as a standalone cluster and also supports
[Kubernetes](https://kubernetes.io/). In either
+case, the scheduler can be configured to use [etcd](https://etcd.io/) as a
backing store to (eventually) provide
+redundancy in the case of a scheduler failing.
-DataFusion is used to create modern, fast and efficient data
-pipelines, ETL processes, and database systems, which need the
-performance of Rust and Apache Arrow and want to provide their users
-the convenience of an SQL interface or a DataFrame API.
+# Getting Started
-## Why DataFusion?
+Refer to the core [Ballista crate README](ballista/rust/client/README.md) for
the Getting Started guide.
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion
achieves very high performance
-- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet
and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design,
DataFusion can be tailored for your specific usecase
-- _High Quality_: Extensively tested, both by itself and with the rest of the
Arrow ecosystem, DataFusion can be used as the foundation for production
systems.
+## Distributed Scheduler Overview
-## Known Uses
+Ballista uses the DataFusion query execution framework to create a physical
plan and then transforms it into a
+distributed physical plan by breaking the query down into stages whenever the
partitioning scheme changes.
-Projects that adapt to or serve as plugins to DataFusion:
+Specifically, any `RepartitionExec` operator is replaced with an
`UnresolvedShuffleExec` and the child operator
+of the repartition operator is wrapped in a `ShuffleWriterExec` operator and
scheduled for execution.
-- [datafusion-python](https://github.com/datafusion-contrib/datafusion-python)
-- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
--
[datafusion-objectstore-s3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
--
[datafusion-objectstore-hdfs](https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)
--
[datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
--
[datafusion-objectstore-azure](https://github.com/datafusion-contrib/datafusion-objectstore-azure)
+Each executor polls the scheduler for the next task to run. Tasks are
currently always `ShuffleWriterExec` operators
+and each task represents one _input_ partition that will be executed. The
resulting batches are repartitioned
+according to the shuffle partitioning scheme and each _output_ partition is
streamed to disk in Arrow IPC format.
-Here are some of the projects known to use DataFusion:
+The scheduler will replace `UnresolvedShuffleExec` operators with
`ShuffleReaderExec` operators once all shuffle
+tasks have completed. The `ShuffleReaderExec` operator connects to other
executors as required using the Flight
+interface, and streams the shuffle IPC files.
-- [Ballista](ballista) Distributed Compute Platform
-- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
-- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
-- [delta-rs](https://github.com/delta-io/delta-rs)
-- [Flock](https://github.com/flock-lab/flock)
-- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series
Database
-- [ROAPI](https://github.com/roapi/roapi)
-- [Tensorbase](https://github.com/tensorbase/tensorbase)
-- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the
[Vega](https://vega.github.io/) visualization grammar
+# How does this compare to Apache Spark?
-(if you know of another project, please submit a PR to add a link!)
+Ballista implements a similar design to Apache Spark, but there are some key
differences.
-## Example Usage
-
-Please see [example
usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) to
find how to use DataFusion.
-
-## Roadmap
-
-Please see [Roadmap](docs/source/specification/roadmap.md) for information of
where the project is headed.
+- The choice of Rust as the main execution language means that memory usage is
deterministic and avoids the overhead of
+ GC pauses.
+- Ballista is designed from the ground up to use columnar data, enabling a
number of efficiencies such as vectorized
+ processing (SIMD and GPU) and efficient compression. Although Spark does
have some columnar support, it is still
+ largely row-based today.
+- The combination of Rust and Arrow provides excellent memory efficiency and
memory usage can be 5x - 10x lower than
+ Apache Spark in some cases, which means that more processing can fit on a
single node, reducing the overhead of
+ distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that
data can be exchanged between executors
+ in any programming language with minimal serialization overhead.
## Architecture Overview
-There is no formal document describing DataFusion's architecture yet, but the
following presentations offer a good overview of its different components and
how they interact together.
-
-- (March 2021): The DataFusion architecture is described in _Query Engine
Design and the Rust-Based DataFusion in Apache Arrow_:
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content
starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s))
and
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-- (February 2021): How DataFusion is used within the Ballista Project is
described in \*Ballista: Distributed Compute with Rust and Apache Arrow:
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-
-## User's guide
+There is no formal document describing Ballista's architecture yet, but the
following presentation offers a good overview of its different components and
how they interact together.
-Please see [User Guide](https://arrow.apache.org/datafusion/) for more
information about DataFusion.
+- (February 2021): Ballista: Distributed Compute with Rust and Apache Arrow:
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
## Contribution Guide
diff --git a/ballista/README.md b/ballista/README.md
deleted file mode 100644
index 1fc3bdbf..00000000
--- a/ballista/README.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Ballista: Distributed Compute with Apache Arrow and DataFusion
-
-Ballista is a distributed compute platform primarily implemented in Rust, and
powered by Apache Arrow and
-DataFusion. It is built on an architecture that allows other programming
languages (such as Python, C++, and
-Java) to be supported as first-class citizens without paying a penalty for
serialization costs.
-
-The foundational technologies in Ballista are:
-
-- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels
for efficient processing of data.
-- [Apache Arrow Flight
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
for efficient
- data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
for serializing query plans.
-- [Docker](https://www.docker.com/) for packaging up executors along with
user-defined code.
-
-Ballista can be deployed as a standalone cluster and also supports
[Kubernetes](https://kubernetes.io/). In either
-case, the scheduler can be configured to use [etcd](https://etcd.io/) as a
backing store to (eventually) provide
-redundancy in the case of a scheduler failing.
-
-# Getting Started
-
-Refer to the core [Ballista crate README](rust/client/README.md) for the
Getting Started guide.
-
-## Distributed Scheduler Overview
-
-Ballista uses the DataFusion query execution framework to create a physical
plan and then transforms it into a
-distributed physical plan by breaking the query down into stages whenever the
partitioning scheme changes.
-
-Specifically, any `RepartitionExec` operator is replaced with an
`UnresolvedShuffleExec` and the child operator
-of the repartition operator is wrapped in a `ShuffleWriterExec` operator and
scheduled for execution.
-
-Each executor polls the scheduler for the next task to run. Tasks are
currently always `ShuffleWriterExec` operators
-and each task represents one _input_ partition that will be executed. The
resulting batches are repartitioned
-according to the shuffle partitioning scheme and each _output_ partition is
streamed to disk in Arrow IPC format.
-
-The scheduler will replace `UnresolvedShuffleExec` operators with
`ShuffleReaderExec` operators once all shuffle
-tasks have completed. The `ShuffleReaderExec` operator connects to other
executors as required using the Flight
-interface, and streams the shuffle IPC files.
-
-# How does this compare to Apache Spark?
-
-Ballista implements a similar design to Apache Spark, but there are some key
differences.
-
-- The choice of Rust as the main execution language means that memory usage is
deterministic and avoids the overhead of
- GC pauses.
-- Ballista is designed from the ground up to use columnar data, enabling a
number of efficiencies such as vectorized
- processing (SIMD and GPU) and efficient compression. Although Spark does
have some columnar support, it is still
- largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and
memory usage can be 5x - 10x lower than
- Apache Spark in some cases, which means that more processing can fit on a
single node, reducing the overhead of
- distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that
data can be exchanged between executors
- in any programming language with minimal serialization overhead.
diff --git a/ballista/rust/client/README.md b/ballista/rust/client/README.md
index ecf364f4..f5fe094d 100644
--- a/ballista/rust/client/README.md
+++ b/ballista/rust/client/README.md
@@ -35,7 +35,7 @@ Ballista can be deployed as a standalone cluster and also
supports [Kubernetes](
case, the scheduler can be configured to use [etcd](https://etcd.io/) as a
backing store to (eventually) provide
redundancy in the case of a scheduler failing.
-## Rust Version Compatbility
+## Rust Version Compatibility
This crate is tested with the latest stable version of Rust. We do not
currrently test against other, older versions of the Rust compiler.