This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/master by this push:
new 6313f94 Add Ballista roadmap (#1166)
6313f94 is described below
commit 6313f9423f8eee97ef72b38864e99a2d341bd4f9
Author: Andy Grove <[email protected]>
AuthorDate: Sat Oct 23 05:49:08 2021 -0600
Add Ballista roadmap (#1166)
* Add Ballista roadmap
* prettier
* edits
---
docs/source/specification/roadmap.md | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/docs/source/specification/roadmap.md
b/docs/source/specification/roadmap.md
index 520815b..219fb71 100644
--- a/docs/source/specification/roadmap.md
+++ b/docs/source/specification/roadmap.md
@@ -92,8 +92,28 @@ Note: There are some additional thoughts on a datafusion-cli
vision on [#1096](h
- publishing to apt, brew, and possible NuGet registry so that people can use
it more easily
- adopt a shorter name, like dfcli?
-## Ballista
+# Ballista
-# Vision
+Ballista is a distributed compute platform based on Apache Arrow and
DataFusion. It provides a query scheduler that
+breaks a physical plan into stages and tasks and then schedules tasks for
execution across the available executors
+in the cluster.
-TBD
+Having Ballista as part of the DataFusion codebase helps ensure that
DataFusion remains suitable for distributed
+compute. For example, it helps ensure that physical query plans can be
serialized to protobuf format and that they
+remain language-agnostic so that executors can be built in languages other
than Rust.
+
+## Ballista Roadmap
+
+## Move query scheduler into DataFusion
+
+The Ballista scheduler has some advantages over DataFusion query execution
because it doesn't try to eagerly execute
+the entire query at once but breaks it down into a directionally-acyclic graph
(DAG) of stages and executes a
+configurable number of stages and tasks concurrently. It should be possible to
push some of this logic down to
+DataFusion so that the same scheduler can be used to scale across cores
in-process and across nodes in a cluster.
+
+## Implement execution-time cost-based optimizations based on statistics
+
+After the execution of a query stage, accurate statistics are available for
the resulting data. These statistics
+could be leveraged by the scheduler to optimize the query during execution.
For example, when performing a hash join
+it is desirable to load the smaller side of the join into memory and in some
cases we cannot predict which side will
+be smaller until execution time.