This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-ray.git
The following commit(s) were added to refs/heads/main by this push:
new f263522 Update Datafusion Ray architecture docs (#27)
f263522 is described below
commit f2635228027e09424c20629faee07aca2384f9e3
Author: Austin Liu <[email protected]>
AuthorDate: Mon Oct 14 09:03:39 2024 +0800
Update Datafusion Ray architecture docs (#27)
* Update Datafusion Ray architecture docs
Signed-off-by: Austin Liu <[email protected]>
* Focus on current architecture
Signed-off-by: Austin Liu <[email protected]>
---------
Signed-off-by: Austin Liu <[email protected]>
---
docs/README.md | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)
diff --git a/docs/README.md b/docs/README.md
index 516c338..1695521 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -17,12 +17,12 @@
under the License.
-->
-# RaySQL Design Documentation
+# DataFusion Ray Design Documentation
-RaySQL is a distributed SQL query engine that is powered by DataFusion.
+DataFusion Ray is a distributed SQL query engine that is powered by DataFusion
and Ray.
DataFusion provides a high-performance query engine that is already
partition-aware, with partitions being executed
-in parallel in separate threads. RaySQL provides a distributed query planner
that translates a DataFusion physical
+in parallel in separate threads. DataFusion Ray provides a distributed query
planner that translates a DataFusion physical
plan into a distributed plan.
Let's walk through an example to see how that works. We'll use
[SQLBench-H](https://github.com/sql-benchmarks/sqlbench-h)
@@ -83,9 +83,6 @@ DataFusion's physical plan lists all the files to be queried,
and they are organ
parallel execution within a single process. In this example, the level of
concurrency was configured to be four, so
we see `partitions={4 groups: [[ ... ]]` in the leaf `ParquetExec` nodes, with
the filenames listed in four groups.
-_DataFusion will soon support parallel execution for single Parquet files but
for now the parallelism is based on
-splitting the available files into separate groups, so RaySQL will not yet
scale well for single-file inputs._
-
Here is the full physical plan for query 3.
```text
@@ -123,7 +120,7 @@ GlobalLimitExec: skip=0, fetch=10
## Partitioning & Distribution
The partitioning scheme changes throughout the plan and this is the most
important concept to
-understand in order to understand RaySQL's design. Changes in partitioning are
implemented by the `RepartitionExec`
+understand in order to understand DataFusion Ray's design. Changes in
partitioning are implemented by the `RepartitionExec`
operator in DataFusion and are happen in the following scenarios.
### Joins
@@ -155,7 +152,7 @@ Sort also has multiple approaches.
- The input partitions can be collapsed down to a single partition and then
sorted
- Partitions can be sorted in parallel and then merged using a sort-preserving
merge
-DataFusion and RaySQL currently the first approach, but there is a DataFusion
PR open for implementing the second.
+DataFusion and DataFusion Ray currently choose the first approach, but there
is a DataFusion PR open for implementing the second.
### Limit
@@ -260,13 +257,12 @@ child plans, building up a DAG of futures.
## Distributed Shuffle
-The output of each query stage needs to be persisted somewhere so that the
next query stage can read it. Currently,
-RaySQL is just writing the output to disk in Arrow IPC format, and this means
that RaySQL is not truly distributed
-yet because it requires a shared file system. It would be better to use the
Ray object store instead, as
-proposed [here](https://github.com/datafusion-contrib/ray-sql/issues/22).
+The output of each query stage needs to be persisted somewhere so that the
next query stage can read it.
+
+DataFusion Ray uses the Ray object store as a shared file system, which was
proposed [here](https://github.com/datafusion-contrib/ray-sql/issues/22) and
implemented [here](https://github.com/datafusion-contrib/ray-sql/pull/33).
DataFusion's `RepartitionExec` uses threads and channels within a single
process and is not suitable for a
-distributed query engine, so RaySQL rewrites the physical plan and replaces
the `RepartionExec` with a pair of
+distributed query engine, so DataFusion Ray rewrites the physical plan and
replaces the `RepartionExec` with a pair of
operators to perform a "shuffle". These are the `ShuffleWriterExec` and
`ShuffleReaderExec`.
### Shuffle Writes
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]