[arrow-ballista] branch master updated: Replace README with Ballista version (#4)

agrove Thu, 19 May 2022 11:35:51 -0700

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git



The following commit(s) were added to refs/heads/master by this push:
     new 57170d81 Replace README with Ballista version (#4)
57170d81 is described below

commit 57170d818b49506dcc5a4525b595bebf92f37bad
Author: Andy Grove <[email protected]>
AuthorDate: Thu May 19 12:35:26 2022 -0600

    Replace README with Ballista version (#4)
---
 README.md                      | 97 +++++++++++++++++++-----------------------
 ballista/README.md             | 71 -------------------------------
 ballista/rust/client/README.md |  2 +-
 3 files changed, 44 insertions(+), 126 deletions(-)

diff --git a/README.md b/README.md
index 58a511da..28477d12 100644
--- a/README.md
+++ b/README.md
@@ -17,79 +17,68 @@
   under the License.
 -->
 
-# DataFusion
+_Please note that Ballista development is still happening in the
+[DataFusion repository](https://github.com/apache/arrow-datafusion) but we are 
in the
+process of migrating to this new repository._
 
-<img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" 
width="256"/>
+# Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion
 
-DataFusion is an extensible query execution framework, written in
-Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
-in-memory format.
+Ballista is a distributed compute platform primarily implemented in Rust, and 
powered by Apache Arrow and
+DataFusion. It is built on an architecture that allows other programming 
languages (such as Python, C++, and
+Java) to be supported as first-class citizens without paying a penalty for 
serialization costs.
 
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+The foundational technologies in Ballista are:
 
-DataFusion also supports distributed query execution via the
-[Ballista](ballista/README.md) crate.
+- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels 
for efficient processing of data.
+- [Apache Arrow Flight 
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) 
for efficient
+  data transfer between processes.
+- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) 
for serializing query plans.
+- [Docker](https://www.docker.com/) for packaging up executors along with 
user-defined code.
 
-## Use Cases
+Ballista can be deployed as a standalone cluster and also supports 
[Kubernetes](https://kubernetes.io/). In either
+case, the scheduler can be configured to use [etcd](https://etcd.io/) as a 
backing store to (eventually) provide
+redundancy in the case of a scheduler failing.
 
-DataFusion is used to create modern, fast and efficient data
-pipelines, ETL processes, and database systems, which need the
-performance of Rust and Apache Arrow and want to provide their users
-the convenience of an SQL interface or a DataFrame API.
+# Getting Started
 
-## Why DataFusion?
+Refer to the core [Ballista crate README](ballista/rust/client/README.md) for 
the Getting Started guide.
 
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion 
achieves very high performance
-- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific usecase
-- _High Quality_: Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
+## Distributed Scheduler Overview
 
-## Known Uses
+Ballista uses the DataFusion query execution framework to create a physical 
plan and then transforms it into a
+distributed physical plan by breaking the query down into stages whenever the 
partitioning scheme changes.
 
-Projects that adapt to or serve as plugins to DataFusion:
+Specifically, any `RepartitionExec` operator is replaced with an 
`UnresolvedShuffleExec` and the child operator
+of the repartition operator is wrapped in a `ShuffleWriterExec` operator and 
scheduled for execution.
 
-- [datafusion-python](https://github.com/datafusion-contrib/datafusion-python)
-- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
-- 
[datafusion-objectstore-s3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
-- 
[datafusion-objectstore-hdfs](https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)
-- 
[datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
-- 
[datafusion-objectstore-azure](https://github.com/datafusion-contrib/datafusion-objectstore-azure)
+Each executor polls the scheduler for the next task to run. Tasks are 
currently always `ShuffleWriterExec` operators
+and each task represents one _input_ partition that will be executed. The 
resulting batches are repartitioned
+according to the shuffle partitioning scheme and each _output_ partition is 
streamed to disk in Arrow IPC format.
 
-Here are some of the projects known to use DataFusion:
+The scheduler will replace `UnresolvedShuffleExec` operators with 
`ShuffleReaderExec` operators once all shuffle
+tasks have completed. The `ShuffleReaderExec` operator connects to other 
executors as required using the Flight
+interface, and streams the shuffle IPC files.
 
-- [Ballista](ballista) Distributed Compute Platform
-- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
-- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
-- [delta-rs](https://github.com/delta-io/delta-rs)
-- [Flock](https://github.com/flock-lab/flock)
-- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series 
Database
-- [ROAPI](https://github.com/roapi/roapi)
-- [Tensorbase](https://github.com/tensorbase/tensorbase)
-- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the 
[Vega](https://vega.github.io/) visualization grammar
+# How does this compare to Apache Spark?
 
-(if you know of another project, please submit a PR to add a link!)
+Ballista implements a similar design to Apache Spark, but there are some key 
differences.
 
-## Example Usage
-
-Please see [example 
usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) to 
find how to use DataFusion.
-
-## Roadmap
-
-Please see [Roadmap](docs/source/specification/roadmap.md) for information of 
where the project is headed.
+- The choice of Rust as the main execution language means that memory usage is 
deterministic and avoids the overhead of
+  GC pauses.
+- Ballista is designed from the ground up to use columnar data, enabling a 
number of efficiencies such as vectorized
+  processing (SIMD and GPU) and efficient compression. Although Spark does 
have some columnar support, it is still
+  largely row-based today.
+- The combination of Rust and Arrow provides excellent memory efficiency and 
memory usage can be 5x - 10x lower than
+  Apache Spark in some cases, which means that more processing can fit on a 
single node, reducing the overhead of
+  distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged between executors
+  in any programming language with minimal serialization overhead.
 
 ## Architecture Overview
 
-There is no formal document describing DataFusion's architecture yet, but the 
following presentations offer a good overview of its different components and 
how they interact together.
-
-- (March 2021): The DataFusion architecture is described in _Query Engine 
Design and the Rust-Based DataFusion in Apache Arrow_: 
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content 
starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) 
and 
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-- (February 2021): How DataFusion is used within the Ballista Project is 
described in \*Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-
-## User's guide
+There is no formal document describing Ballista's architecture yet, but the 
following presentation offers a good overview of its different components and 
how they interact together.
 
-Please see [User Guide](https://arrow.apache.org/datafusion/) for more 
information about DataFusion.
+- (February 2021): Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
 
 ## Contribution Guide
 
diff --git a/ballista/README.md b/ballista/README.md
deleted file mode 100644
index 1fc3bdbf..00000000
--- a/ballista/README.md
+++ /dev/null
@@ -1,71 +0,0 @@
-<!---
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
--->
-
-# Ballista: Distributed Compute with Apache Arrow and DataFusion
-
-Ballista is a distributed compute platform primarily implemented in Rust, and 
powered by Apache Arrow and
-DataFusion. It is built on an architecture that allows other programming 
languages (such as Python, C++, and
-Java) to be supported as first-class citizens without paying a penalty for 
serialization costs.
-
-The foundational technologies in Ballista are:
-
-- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels 
for efficient processing of data.
-- [Apache Arrow Flight 
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) 
for efficient
-  data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) 
for serializing query plans.
-- [Docker](https://www.docker.com/) for packaging up executors along with 
user-defined code.
-
-Ballista can be deployed as a standalone cluster and also supports 
[Kubernetes](https://kubernetes.io/). In either
-case, the scheduler can be configured to use [etcd](https://etcd.io/) as a 
backing store to (eventually) provide
-redundancy in the case of a scheduler failing.
-
-# Getting Started
-
-Refer to the core [Ballista crate README](rust/client/README.md) for the 
Getting Started guide.
-
-## Distributed Scheduler Overview
-
-Ballista uses the DataFusion query execution framework to create a physical 
plan and then transforms it into a
-distributed physical plan by breaking the query down into stages whenever the 
partitioning scheme changes.
-
-Specifically, any `RepartitionExec` operator is replaced with an 
`UnresolvedShuffleExec` and the child operator
-of the repartition operator is wrapped in a `ShuffleWriterExec` operator and 
scheduled for execution.
-
-Each executor polls the scheduler for the next task to run. Tasks are 
currently always `ShuffleWriterExec` operators
-and each task represents one _input_ partition that will be executed. The 
resulting batches are repartitioned
-according to the shuffle partitioning scheme and each _output_ partition is 
streamed to disk in Arrow IPC format.
-
-The scheduler will replace `UnresolvedShuffleExec` operators with 
`ShuffleReaderExec` operators once all shuffle
-tasks have completed. The `ShuffleReaderExec` operator connects to other 
executors as required using the Flight
-interface, and streams the shuffle IPC files.
-
-# How does this compare to Apache Spark?
-
-Ballista implements a similar design to Apache Spark, but there are some key 
differences.
-
-- The choice of Rust as the main execution language means that memory usage is 
deterministic and avoids the overhead of
-  GC pauses.
-- Ballista is designed from the ground up to use columnar data, enabling a 
number of efficiencies such as vectorized
-  processing (SIMD and GPU) and efficient compression. Although Spark does 
have some columnar support, it is still
-  largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and 
memory usage can be 5x - 10x lower than
-  Apache Spark in some cases, which means that more processing can fit on a 
single node, reducing the overhead of
-  distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged between executors
-  in any programming language with minimal serialization overhead.
diff --git a/ballista/rust/client/README.md b/ballista/rust/client/README.md
index ecf364f4..f5fe094d 100644
--- a/ballista/rust/client/README.md
+++ b/ballista/rust/client/README.md
@@ -35,7 +35,7 @@ Ballista can be deployed as a standalone cluster and also 
supports [Kubernetes](
 case, the scheduler can be configured to use [etcd](https://etcd.io/) as a 
backing store to (eventually) provide
 redundancy in the case of a scheduler failing.
 
-## Rust Version Compatbility
+## Rust Version Compatibility
 
 This crate is tested with the latest stable version of Rust. We do not 
currrently test against other, older versions of the Rust compiler.

[arrow-ballista] branch master updated: Replace README with Ballista version (#4)

Reply via email to