[arrow-datafusion] branch main updated: Minor: port some content to the docs (#5684)

alamb Wed, 22 Mar 2023 03:55:34 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new af97ac886c Minor: port some content to the docs (#5684)
af97ac886c is described below

commit af97ac886c425efefb0536c5344894703f65d7fa
Author: Andrew Lamb <[email protected]>
AuthorDate: Wed Mar 22 11:55:17 2023 +0100

    Minor: port some content to the docs (#5684)
    
    * Minor: port some content to the docs
    
    * prettier
---
 docs/source/user-guide/introduction.md | 51 +++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 10 deletions(-)

diff --git a/docs/source/user-guide/introduction.md 
b/docs/source/user-guide/introduction.md
index 64b6be9d28..5e6859f8d6 100644
--- a/docs/source/user-guide/introduction.md
+++ b/docs/source/user-guide/introduction.md
@@ -19,21 +19,52 @@
 
 # Introduction
 
-DataFusion is an extensible query execution framework, written in
-Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+DataFusion is a very fast, extensible query engine for building high-quality 
data-centric systems in
+[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org)
 in-memory format.
 
-DataFusion supports SQL and a DataFrame API for building logical query
-plans, an extensive query optimizer, and a multi-threaded parallel
-execution execution engine for processing partitioned data sources
-such as CSV and Parquet files extremely quickly.
+DataFusion offers SQL and Dataframe APIs, excellent 
[performance](https://benchmark.clickhouse.com/), built-in support for CSV, 
Parquet, JSON, and Avro, extensive customization, and a great community.
+
+## Features
+
+- Feature-rich [SQL 
support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and 
[DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
+- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
+- Native support for Parquet, CSV, JSON, and Avro file formats. Support
+  for custom file formats and non file datasources via the `TableProvider` 
trait.
+- Many extension points: user defined scalar/aggregate/window functions, 
DataSources, SQL,
+  other query languages, custom plan and execution nodes, optimizer passes, 
and more.
+- Streaming, asynchronous IO directly from popular object stores, including 
AWS S3,
+  Azure Blob Storage, and Google Cloud Storage. Other storage systems are 
supported via the
+  `ObjectStore` trait.
+- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
+  [welcoming 
community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
+- A state of the art query optimizer with projection and filter pushdown, sort 
aware optimizations,
+  automatic join reordering, expression coercion, and more.
+- Permissive Apache 2.0 License, Apache Software Foundation governance
+- Written in [Rust](https://www.rust-lang.org/), a modern system language with 
development
+  productivity similar to Java or Golang, the performance of C++, and
+  [loved by programmers 
everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
+- Support for [Substrait](https://substrait.io/) for query plan serialization, 
making it easier to integrate DataFusion
+  with other projects, and to pass plans across language boundaries.
 
 ## Use Cases
 
-DataFusion is used to create modern, fast and efficient data
-pipelines, ETL processes, and database systems, which need the
-performance of Rust and Apache Arrow and want to provide their users
-the convenience of an SQL interface or a DataFrame API.
+DataFusion can be used without modification as an embedded SQL
+engine or can be customized and used as a foundation for
+building new systems. Here are some examples of systems built using DataFusion:
+
+- Specialized Analytical Database systems such as [CeresDB] and more general 
Apache Spark like system such a [Ballista].
+- New query language engines such as [prql-query] and accelerators such as 
[VegaFusion]
+- Research platform for new Database Systems, such as [Flock]
+- SQL support to another library, such as [dask sql]
+- Streaming data platforms such as [Synnada]
+- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files 
such as [qv]
+- A faster Spark runtime replacement [Blaze]
+
+By using DataFusion, the projects are freed to focus on their specific
+features, and avoid reimplementing general (but still necessary)
+features such as an expression representation, standard optimizations,
+execution plans, file format support, etc.
 
 ## Why DataFusion?

[arrow-datafusion] branch main updated: Minor: port some content to the docs (#5684)

Reply via email to