This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git
The following commit(s) were added to refs/heads/master by this push:
new fc240966 README updates (#433)
fc240966 is described below
commit fc2409667b07037aa181d2cba62f8b52e4f023c1
Author: Andy Grove <[email protected]>
AuthorDate: Sun Oct 23 17:05:50 2022 -0600
README updates (#433)
---
README.md | 39 ++++++++++++++++++++------
docs/developer/images/ballista-benchmarks.png | Bin 0 -> 24749 bytes
2 files changed, 30 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
index 06d0c5f4..bcd7c734 100644
--- a/README.md
+++ b/README.md
@@ -50,6 +50,16 @@ Ballista implements a similar design to Apache Spark
(particularly Spark SQL), b
- Scheduler web interface and REST UI for monitoring query progress and
viewing query plans and metrics.
- Support for Docker, Docker Compose, and Kubernetes deployment, as well as
manual deployment on bare metal.
+## Performance
+
+We run some simple benchmarks comparing Ballista with Apache Spark to track
progress with performance optimizations.
+These are benchmarks derived from TPC-H and not official TPC-H benchmarks.
These results are from running individual
+queries at scale factor 10 (10 GB) on a single node with a single executor and
24 concurrent tasks.
+
+The tracking issue for improving these results is
[#339](https://github.com/apache/arrow-ballista/issues/339).
+
+
+
# Getting Started
The easiest way to get started is to run one of the standalone or distributed
[examples](./examples/README.md). After
@@ -74,25 +84,35 @@ The current focus is on the following items:
- Make production ready
- Shuffle file cleanup
- - Periodically
+ - Periodically
([#185](https://github.com/apache/arrow-ballista/issues/185))
- Add gRPC & REST interfaces for clients/UI to actively call the cleanup
for a job or the whole system
- Fill functional gaps between DataFusion and Ballista
- Improve task scheduling and data exchange efficiency
- Better error handling
- - Schedule restart
+ - Scheduler restart
- Improve monitoring, logging, and metrics
- Auto scaling support
- Better configuration management
-- All-at-once job task scheduling
+ - Support for multi-scheduler deployments. Initially for resiliency and
fault tolerance but ultimately to support
+ sharding for scalability and more efficient caching.
- Shuffle improvement
- - Shuffle memory control
+ - Shuffle memory control
([#320](https://github.com/apache/arrow-ballista/issues/320))
- Improve shuffle IO to avoid producing too many files
- Support sort-based shuffle
- Support range partition
- - Support broadcast shuffle
-- Support for multi-scheduler deployments. Initially for resiliency and fault
tolerance but ultimately to support
- sharding for scalability and more efficient caching.
-- Executor deployment grouping based on resource allocation
+ - Support broadcast shuffle
([#342](https://github.com/apache/arrow-ballista/issues/342))
+- Scheduler Improvements
+ - All-at-once job task scheduling
+ - Executor deployment grouping based on resource allocation
+- Cloud Support
+ - Support Azure Blob Storage
([#294](https://github.com/apache/arrow-ballista/issues/294))
+ - Support Google Cloud Storage
([#293](https://github.com/apache/arrow-ballista/issues/293))
+- Performance and scalability
+ - Implement Adaptive Query Execution
([#387](https://github.com/apache/arrow-ballista/issues/387))
+ - Implement bubble execution
([#408](https://github.com/apache/arrow-ballista/issues/408))
+ - Improve benchmark results
([#339](https://github.com/apache/arrow-ballista/issues/339))
+- Python Support
+ - Support Python UDFs
([#173](https://github.com/apache/arrow-ballista/issues/173))
## Architecture Overview
@@ -102,10 +122,11 @@ Statistical Programming Meetup (Feb 2021).
## Contribution Guide
-Please see [Contribution Guide](CONTRIBUTING.md) for information about
contributing to DataFusion.
+Please see the [Contribution Guide](CONTRIBUTING.md) for information about
contributing to Ballista.
[arrow]: https://arrow.apache.org/
[datafusion]: https://github.com/apache/arrow-datafusion
[flight]: https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
[flight-sql]:
https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/
[ballista-talk]: https://www.youtube.com/watch?v=ZZHQaOap9pQ
+[user-guide]: https://arrow.apache.org/ballista/
diff --git a/docs/developer/images/ballista-benchmarks.png
b/docs/developer/images/ballista-benchmarks.png
new file mode 100644
index 00000000..28656e2e
Binary files /dev/null and b/docs/developer/images/ballista-benchmarks.png
differ