[arrow-datafusion] branch master updated: Add Ballista roadmap (#1166)

alamb Sat, 23 Oct 2021 04:49:16 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git



The following commit(s) were added to refs/heads/master by this push:
     new 6313f94  Add Ballista roadmap (#1166)
6313f94 is described below

commit 6313f9423f8eee97ef72b38864e99a2d341bd4f9
Author: Andy Grove <[email protected]>
AuthorDate: Sat Oct 23 05:49:08 2021 -0600

    Add Ballista roadmap (#1166)
    
    * Add Ballista roadmap
    
    * prettier
    
    * edits
---
 docs/source/specification/roadmap.md | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/docs/source/specification/roadmap.md 
b/docs/source/specification/roadmap.md
index 520815b..219fb71 100644
--- a/docs/source/specification/roadmap.md
+++ b/docs/source/specification/roadmap.md
@@ -92,8 +92,28 @@ Note: There are some additional thoughts on a datafusion-cli 
vision on [#1096](h
 - publishing to apt, brew, and possible NuGet registry so that people can use 
it more easily
 - adopt a shorter name, like dfcli?
 
-## Ballista
+# Ballista
 
-# Vision
+Ballista is a distributed compute platform based on Apache Arrow and 
DataFusion. It provides a query scheduler that
+breaks a physical plan into stages and tasks and then schedules tasks for 
execution across the available executors
+in the cluster.
 
-TBD
+Having Ballista as part of the DataFusion codebase helps ensure that 
DataFusion remains suitable for distributed
+compute. For example, it helps ensure that physical query plans can be 
serialized to protobuf format and that they
+remain language-agnostic so that executors can be built in languages other 
than Rust.
+
+## Ballista Roadmap
+
+## Move query scheduler into DataFusion
+
+The Ballista scheduler has some advantages over DataFusion query execution 
because it doesn't try to eagerly execute
+the entire query at once but breaks it down into a directionally-acyclic graph 
(DAG) of stages and executes a
+configurable number of stages and tasks concurrently. It should be possible to 
push some of this logic down to
+DataFusion so that the same scheduler can be used to scale across cores 
in-process and across nodes in a cluster.
+
+## Implement execution-time cost-based optimizations based on statistics
+
+After the execution of a query stage, accurate statistics are available for 
the resulting data. These statistics
+could be leveraged by the scheduler to optimize the query during execution. 
For example, when performing a hash join
+it is desirable to load the smaller side of the join into memory and in some 
cases we cannot predict which side will
+be smaller until execution time.

[arrow-datafusion] branch master updated: Add Ballista roadmap (#1166)

Reply via email to