jorgecarleitao commented on a change in pull request #9701:
URL: https://github.com/apache/arrow/pull/9701#discussion_r594419150



##########
File path: rust/datafusion/DEVELOPERS.md
##########
@@ -0,0 +1,98 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Developer's guide
+
+This section describes how you can get started at developing DataFusion.
+
+### Bootstrap environment
+
+DataFusion is written in Rust and it uses a standard rust toolkit:
+
+* `cargo build`
+* `cargo fmt` to format the code
+* `cargo test` to test
+* etc.
+
+### Architecture Overview

Review comment:
       IMO we could keep this section on the README, since it concerns everyone 
(not just contributors / developers).
   

##########
File path: rust/datafusion/README.md
##########
@@ -19,11 +19,50 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+<img src="docs/images/DataFusion-Logo-Dark.svg" width="256"/>
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion 
achieves superior performance

Review comment:
       we have no benchmarks against spark, dask, pandas, etc, so we need to be 
careful about this claim. I think that @Dandandan was working in running 
benchmarks against other engines, ARROW-11252, to see it performs.

##########
File path: rust/datafusion/DEVELOPERS.md
##########
@@ -0,0 +1,79 @@
+# Developer's guide
+
+This section describes how you can get started at developing DataFusion.
+
+### Bootstrap environment
+
+DataFusion is written in Rust and it uses a standard rust toolkit:
+
+* `cargo build`
+* `cargo fmt` to format the code
+* `cargo test` to test
+* etc.
+
+### Architecture Overview
+
+* (March 2021): The DataFusion architecture is described in *Query Engine 
Design and the Rust-Based DataFusion in Apache Arrow*: 
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content 
starts ~ 15 minutes in) and 
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)

Review comment:
       I agree that the links are helpful and should be kept.
   
   I do think that since those slides and presentations did not go under the 
review / PR process, it may be confusing to say that they describe the 
architecture.
   
   What do you think of
   
   > There is no formal document describing DataFusion's architecture yet, but 
the following presentations offer a good overview of its different components 
and how they interact together.
   

##########
File path: rust/datafusion/README.md
##########
@@ -19,11 +19,50 @@
 
 # DataFusion
 
-DataFusion is an in-memory query engine that uses Apache Arrow as the memory 
model. It supports executing SQL queries against CSV and Parquet files as well 
as querying directly against in-memory data.
+<img src="docs/images/DataFusion-Logo-Dark.svg" width="256"/>
+
+DataFusion is an extensible query execution framework, written in
+Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format.
+
+DataFusion supports both an SQL and a DataFrame API for building
+logical query plans as well as a query optimizer and execution engine
+capable of parallel execution against partitioned data sources (CSV
+and Parquet) using threads.
+
+## Use Cases
+
+DataFusion is used to create modern, fast and efficient data
+pipelines, ETL processes, and database systems, which need the
+performance of Rust and Apache Arrow and want to provide their users
+the convenience of an SQL interface or a DataFrame API.
+
+## Why DataFusion?
+
+* *High Performance*: Leveraging Rust and Arrow's memory model, DataFusion 
achieves superior performance
+* *Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet 
and Flight), DataFusion works well with the rest of the big data ecosystem
+* *Easy to Embed*: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific usecase
+* *High Quality*:  Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
+
+## Known Uses
+
+Here are some of the projects known to use DataFusion:
+
+* [Ballista](https://github.com/ballista-compute/ballista) Distributed Compute 
Platform
+* [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
+* [Cube.js](https://github.com/cube-js/cube.js)
+* [delta-rs](https://github.com/delta-io/delta-rs)
+* [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series 
Database
+* [ROAPI](https://github.com/roapi/roapi)
+
+(if you know of another project, please submit a PR to add a link!)

Review comment:
       [datafusion-python](https://pypi.org/project/datafusion/)? ^_^




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to