(datafusion-ray) branch main updated: Update Datafusion Ray architecture docs (#27)

agrove Sun, 13 Oct 2024 18:04:58 -0700

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-ray.git



The following commit(s) were added to refs/heads/main by this push:
     new f263522  Update Datafusion Ray architecture docs (#27)
f263522 is described below

commit f2635228027e09424c20629faee07aca2384f9e3
Author: Austin Liu <[email protected]>
AuthorDate: Mon Oct 14 09:03:39 2024 +0800

    Update Datafusion Ray architecture docs (#27)
    
    * Update Datafusion Ray architecture docs
    
    Signed-off-by: Austin Liu <[email protected]>
    
    * Focus on current architecture
    
    Signed-off-by: Austin Liu <[email protected]>
    
    ---------
    
    Signed-off-by: Austin Liu <[email protected]>
---
 docs/README.md | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index 516c338..1695521 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -17,12 +17,12 @@
   under the License.
 -->
 
-# RaySQL Design Documentation
+# DataFusion Ray Design Documentation
 
-RaySQL is a distributed SQL query engine that is powered by DataFusion.
+DataFusion Ray is a distributed SQL query engine that is powered by DataFusion 
and Ray.
 
 DataFusion provides a high-performance query engine that is already 
partition-aware, with partitions being executed
-in parallel in separate threads. RaySQL provides a distributed query planner 
that translates a DataFusion physical
+in parallel in separate threads. DataFusion Ray provides a distributed query 
planner that translates a DataFusion physical
 plan into a distributed plan.
 
 Let's walk through an example to see how that works. We'll use 
[SQLBench-H](https://github.com/sql-benchmarks/sqlbench-h)
@@ -83,9 +83,6 @@ DataFusion's physical plan lists all the files to be queried, 
and they are organ
 parallel execution within a single process. In this example, the level of 
concurrency was configured to be four, so
 we see `partitions={4 groups: [[ ... ]]` in the leaf `ParquetExec` nodes, with 
the filenames listed in four groups.
 
-_DataFusion will soon support parallel execution for single Parquet files but 
for now the parallelism is based on
-splitting the available files into separate groups, so RaySQL will not yet 
scale well for single-file inputs._
-
 Here is the full physical plan for query 3.
 
 ```text
@@ -123,7 +120,7 @@ GlobalLimitExec: skip=0, fetch=10
 ## Partitioning & Distribution
 
 The partitioning scheme changes throughout the plan and this is the most 
important concept to
-understand in order to understand RaySQL's design. Changes in partitioning are 
implemented by the `RepartitionExec`
+understand in order to understand DataFusion Ray's design. Changes in 
partitioning are implemented by the `RepartitionExec`
 operator in DataFusion and are happen in the following scenarios.
 
 ### Joins
@@ -155,7 +152,7 @@ Sort also has multiple approaches.
 - The input partitions can be collapsed down to a single partition and then 
sorted
 - Partitions can be sorted in parallel and then merged using a sort-preserving 
merge
 
-DataFusion and RaySQL currently the first approach, but there is a DataFusion 
PR open for implementing the second.
+DataFusion and DataFusion Ray currently choose the first approach, but there 
is a DataFusion PR open for implementing the second.
 
 ### Limit
 
@@ -260,13 +257,12 @@ child plans, building up a DAG of futures.
 
 ## Distributed Shuffle
 
-The output of each query stage needs to be persisted somewhere so that the 
next query stage can read it. Currently,
-RaySQL is just writing the output to disk in Arrow IPC format, and this means 
that RaySQL is not truly distributed
-yet because it requires a shared file system. It would be better to use the 
Ray object store instead, as
-proposed [here](https://github.com/datafusion-contrib/ray-sql/issues/22).
+The output of each query stage needs to be persisted somewhere so that the 
next query stage can read it.
+
+DataFusion Ray uses the Ray object store as a shared file system, which was 
proposed [here](https://github.com/datafusion-contrib/ray-sql/issues/22) and 
implemented [here](https://github.com/datafusion-contrib/ray-sql/pull/33).
 
 DataFusion's `RepartitionExec` uses threads and channels within a single 
process and is not suitable for a
-distributed query engine, so RaySQL rewrites the physical plan and replaces 
the `RepartionExec` with a pair of
+distributed query engine, so DataFusion Ray rewrites the physical plan and 
replaces the `RepartionExec` with a pair of
 operators to perform a "shuffle". These are the `ShuffleWriterExec` and 
`ShuffleReaderExec`.
 
 ### Shuffle Writes


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-ray) branch main updated: Update Datafusion Ray architecture docs (#27)

Reply via email to