This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/master by this push:
new ff243a4 Add Roadmap to Documentation (#1104)
ff243a4 is described below
commit ff243a40e84a0bf86b69c976ff0ed317fae6df64
Author: Andrew Lamb <[email protected]>
AuthorDate: Tue Oct 19 12:25:26 2021 -0400
Add Roadmap to Documentation (#1104)
* Add Roadmap
* Fix English, add comments from xudong963
* Add datafusion-cli thoughts
* add more links
* Apply suggestions from code review
Co-authored-by: Loïc Sharma <[email protected]>
Co-authored-by: QP Hou <[email protected]>
* Incorporate comments from QP Hou
* prettier
* Update docs/source/specification/roadmap.md
Co-authored-by: Daniël Heres <[email protected]>
* Apply suggestions from code review
Co-authored-by: Carlos <[email protected]>
Co-authored-by: rdettai <[email protected]>
Co-authored-by: Loïc Sharma <[email protected]>
Co-authored-by: QP Hou <[email protected]>
Co-authored-by: Daniël Heres <[email protected]>
Co-authored-by: Carlos <[email protected]>
Co-authored-by: rdettai <[email protected]>
---
README.md | 4 ++
docs/source/index.rst | 1 +
docs/source/specification/roadmap.md | 99 ++++++++++++++++++++++++++++++++++++
3 files changed, 104 insertions(+)
diff --git a/README.md b/README.md
index 458f197..e1f96f0 100644
--- a/README.md
+++ b/README.md
@@ -356,6 +356,10 @@ are mapped to Arrow types according to the following table
| `CUSTOM` | _Not yet supported_ |
| `ARRAY` | _Not yet supported_ |
+# Roadmap
+
+Please see [Roadmap](docs/source/specification/roadmap.md) for information of
where the project is headed.
+
# Architecture Overview
There is no formal document describing DataFusion's architecture yet, but the
following presentations offer a good overview of its different components and
how they interact together.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 6956d0b..bf6b250 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -52,6 +52,7 @@ Table of content
:maxdepth: 1
:caption: Specification
+ specification/roadmap
specification/invariants
specification/output-field-name-semantic
diff --git a/docs/source/specification/roadmap.md
b/docs/source/specification/roadmap.md
new file mode 100644
index 0000000..520815b
--- /dev/null
+++ b/docs/source/specification/roadmap.md
@@ -0,0 +1,99 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Roadmap
+
+This document describes high level goals of the DataFusion and
+Ballista development community. It is not meant to restrict
+possibilities, but rather help newcomers understand the broader
+context of where the community is headed, and inspire
+additional contributions.
+
+DataFusion and Ballista are part of the [Apache
+Arrow](https://arrow.apache.org/) project and governed by the Apache
+Software Foundation governance model. These projects are entirely
+driven by volunteers, and we welcome contributions for items not on
+this roadmap. However, before submitting a large PR, we strongly
+suggest you start a coversation using a github issue or the
[email protected] mailing list to make review efficient and avoid
+surprises.
+
+# DataFusion
+
+DataFusion's goal is to become the embedded query engine of choice
+for new analytic applications, by leveraging the unique features of
+[Rust](https://www.rust-lang.org/) and [Apache
Arrow](https://arrow.apache.org/)
+to provide:
+
+1. Best-in-class single node query performance
+2. A Declarative SQL query interface compatible with PostgreSQL
+3. A Dataframe API, similar to those offered by Pandas and Spark
+4. A Procedural API for programatically creating and running execution plans
+5. High performance, data race free, erogonomic extensibility points at at
every layer
+
+## Additional SQL Language Features
+
+- Complete support list on
[status](https://github.com/apache/arrow-datafusion/blob/master/README.md#status)
+- Timestamp Arithmetic
[#194](https://github.com/apache/arrow-datafusion/issues/194)
+- SQL Parser extension point
[#533](https://github.com/apache/arrow-datafusion/issues/533)
+- Support for nested structures (fields, lists, structs)
[#119](https://github.com/apache/arrow-datafusion/issues/119)
+- Remaining Set Operators (`INTERSECT` / `EXCEPT`)
[#1082](https://github.com/apache/arrow-datafusion/issues/1082)
+- Run all queries from the TPCH benchmark (see
[milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more
details)
+
+## Query Optimizer
+
+- Additional constant folding / partial evaluation
[#1070](https://github.com/apache/arrow-datafusion/issues/1070)
+- More sophisticated cost based optimizer for join ordering
+- Implement advanced query optimization framework (Tokomak) #440
+
+## Datasources
+
+- Better support for reading data from remote filesystems (e.g. S3) without
caching it locally
[#907](https://github.com/apache/arrow-datafusion/issues/907)
[#1060](https://github.com/apache/arrow-datafusion/issues/1060)
+- Support for partitioned datasources
[#1139](https://github.com/apache/arrow-datafusion/issues/1139) and make the
integration of other table formats (Delta, Iceberg...) simpler
+- Improve performances of file format datasources (parallelize file listings,
async Arrow readers, file chunk prefetching capability...)
+
+## Runtime / Infrastructure
+
+- Migrate to some sort of arrow2 based implementation (see
[milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more
details)
+- Add DataFusion to h2oai/db-benchmark
[147](https://github.com/apache/arrow-datafusion/issues/147)
+- Improve build time
[348](https://github.com/apache/arrow-datafusion/issues/348)
+
+## Resource Management
+
+- Finer grain control and limit of runtime memory
[#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage
[#54](https://github.com/apache/arrow-datafusion/issues/64)
+
+## Python Interface
+
+TBD
+
+## DataFusion CLI (`datafusion-cli`)
+
+Note: There are some additional thoughts on a datafusion-cli vision on
[#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).
+
+- Better abstraction between REPL parsing and queries so that commands are
separated and handled correctly
+- Connect to the `Statistics` subsystem and have the cli print out more stats
for query debugging, etc.
+- Improved error handling for interactive use and shell scripting usage
+- publishing to apt, brew, and possible NuGet registry so that people can use
it more easily
+- adopt a shorter name, like dfcli?
+
+## Ballista
+
+# Vision
+
+TBD