This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git
The following commit(s) were added to refs/heads/master by this push:
new 86987e7a Make the user guide about Ballista, not DataFusion (#161)
86987e7a is described below
commit 86987e7ae8cbc440a0efc1bbf0da39dbe0537e13
Author: Andy Grove <[email protected]>
AuthorDate: Sat Aug 27 11:58:33 2022 -0600
Make the user guide about Ballista, not DataFusion (#161)
* Remove Raspberry Pi docs
* Delete DataFusion logos
* User guide is now specific to Ballista
* prettier
---
.../images/DataFusion-Logo-Background-White.png | Bin 12401 -> 0 bytes
.../images/DataFusion-Logo-Background-White.svg | 1 -
.../source/_static/images/DataFusion-Logo-Dark.png | Bin 20134 -> 0 bytes
.../source/_static/images/DataFusion-Logo-Dark.svg | 1 -
.../_static/images/DataFusion-Logo-Light.png | Bin 19102 -> 0 bytes
.../_static/images/DataFusion-Logo-Light.svg | 1 -
docs/source/_static/images/ballista-logo.png | Bin 0 -> 7598 bytes
docs/source/cli/index.rst | 128 --------
docs/source/conf.py | 2 +-
docs/source/index.rst | 44 +--
docs/source/python/api.rst | 30 --
docs/source/python/api/dataframe.rst | 27 --
docs/source/python/api/execution_context.rst | 27 --
docs/source/python/api/expression.rst | 27 --
docs/source/python/api/functions.rst | 27 --
docs/source/python/index.rst | 191 ------------
docs/source/specification/invariants.md | 327 ---------------------
.../specification/output-field-name-semantic.md | 212 -------------
docs/source/specification/quarterly_roadmap.md | 90 ------
docs/source/specification/rfcs/template.md | 58 ----
docs/source/specification/roadmap.md | 118 --------
.../{distributed => }/deployment/cargo-install.md | 0
.../{distributed => }/deployment/configuration.md | 0
.../{distributed => }/deployment/docker-compose.md | 0
.../{distributed => }/deployment/docker.md | 0
.../{distributed => }/deployment/index.rst | 1 -
.../{distributed => }/deployment/kubernetes.md | 0
docs/source/user-guide/distributed/clients/cli.rst | 111 -------
.../user-guide/distributed/clients/index.rst | 26 --
.../distributed/deployment/raspberrypi.md | 129 --------
docs/source/user-guide/distributed/index.rst | 26 --
docs/source/user-guide/distributed/introduction.md | 50 ----
docs/source/user-guide/example-usage.md | 79 -----
docs/source/user-guide/introduction.md | 43 +--
docs/source/user-guide/library.md | 112 -------
.../user-guide/{distributed/clients => }/python.md | 12 +-
.../user-guide/{distributed/clients => }/rust.md | 0
docs/source/user-guide/sql/aggregate_functions.md | 62 ----
docs/source/user-guide/sql/datafusion-functions.md | 110 -------
docs/source/user-guide/sql/ddl.md | 99 -------
docs/source/user-guide/sql/index.rst | 28 --
docs/source/user-guide/sql/select.md | 136 ---------
docs/source/user-guide/sql/sql_status.md | 246 ----------------
43 files changed, 48 insertions(+), 2533 deletions(-)
diff --git a/docs/source/_static/images/DataFusion-Logo-Background-White.png
b/docs/source/_static/images/DataFusion-Logo-Background-White.png
deleted file mode 100644
index 023c2373..00000000
Binary files a/docs/source/_static/images/DataFusion-Logo-Background-White.png
and /dev/null differ
diff --git a/docs/source/_static/images/DataFusion-Logo-Background-White.svg
b/docs/source/_static/images/DataFusion-Logo-Background-White.svg
deleted file mode 100644
index b3bb47c5..00000000
--- a/docs/source/_static/images/DataFusion-Logo-Background-White.svg
+++ /dev/null
@@ -1 +0,0 @@
-<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 350 170"><rect
width="100%" height="105%"
fill="white"/><defs><style>.cls-1{fill:none;stroke:#000;stroke-linecap:round;stroke-miterlimit:10;stroke-width:0.75px;stroke-dasharray:0.75
3;}.cls-2{fill:#f3971f;}.cls-3{fill:#f29720;}</style></defs><title>DataFUSION-Logo-Dark</title><g
id="Layer_2" data-name="Layer 2" transform="translate(10 10)"><g
id="logo"><path class="cls-1"
d="M257.26,112.82c16,20.72,25.14,36.57,22,39.34"/><path class="c [...]
\ No newline at end of file
diff --git a/docs/source/_static/images/DataFusion-Logo-Dark.png
b/docs/source/_static/images/DataFusion-Logo-Dark.png
deleted file mode 100644
index cc60f12a..00000000
Binary files a/docs/source/_static/images/DataFusion-Logo-Dark.png and
/dev/null differ
diff --git a/docs/source/_static/images/DataFusion-Logo-Dark.svg
b/docs/source/_static/images/DataFusion-Logo-Dark.svg
deleted file mode 100644
index e16f2444..00000000
--- a/docs/source/_static/images/DataFusion-Logo-Dark.svg
+++ /dev/null
@@ -1 +0,0 @@
-<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 329.21
164.93"><defs><style>.cls-1{fill:none;stroke:#000;stroke-linecap:round;stroke-miterlimit:10;stroke-width:0.75px;stroke-dasharray:0.75
3;}.cls-2{fill:#f3971f;}.cls-3{fill:#f29720;}</style></defs><title>DataFUSION-Logo-Dark</title><g
id="Layer_2" data-name="Layer 2"><g id="logo"><path class="cls-1"
d="M257.26,112.82c16,20.72,25.14,36.57,22,39.34"/><path class="cls-1"
d="M184.24,37.13c6.55,4.41,15.83,12.47,26.43,23"/><path class="c [...]
\ No newline at end of file
diff --git a/docs/source/_static/images/DataFusion-Logo-Light.png
b/docs/source/_static/images/DataFusion-Logo-Light.png
deleted file mode 100644
index 8992213b..00000000
Binary files a/docs/source/_static/images/DataFusion-Logo-Light.png and
/dev/null differ
diff --git a/docs/source/_static/images/DataFusion-Logo-Light.svg
b/docs/source/_static/images/DataFusion-Logo-Light.svg
deleted file mode 100644
index b3bef219..00000000
--- a/docs/source/_static/images/DataFusion-Logo-Light.svg
+++ /dev/null
@@ -1 +0,0 @@
-<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 329.21
164.93"><defs><style>.cls-1{fill:none;stroke:#fff;stroke-linecap:round;stroke-miterlimit:10;stroke-width:0.75px;stroke-dasharray:0.75
3;}.cls-2{fill:#fff;}.cls-3{fill:#f3971f;}.cls-4{fill:#f29720;}</style></defs><title>DataFUSION-Logo-Light</title><g
id="Layer_2" data-name="Layer 2"><g id="logo"><path class="cls-1"
d="M257.26,112.82c16,20.72,25.14,36.57,22,39.34"/><path class="cls-1"
d="M184.24,37.13c6.55,4.41,15.83,12.47,26.43, [...]
\ No newline at end of file
diff --git a/docs/source/_static/images/ballista-logo.png
b/docs/source/_static/images/ballista-logo.png
new file mode 100644
index 00000000..5c1d1ed2
Binary files /dev/null and b/docs/source/_static/images/ballista-logo.png differ
diff --git a/docs/source/cli/index.rst b/docs/source/cli/index.rst
deleted file mode 100644
index c10db36d..00000000
--- a/docs/source/cli/index.rst
+++ /dev/null
@@ -1,128 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-=======================
-DataFusion Command-line
-=======================
-
-The Arrow DataFusion CLI is a command-line interactive SQL utility that allows
-queries to be executed against CSV and Parquet files. It is a convenient way to
-try DataFusion out with your own data sources.
-
-Install and run using Cargo
-===========================
-
-The easiest way to install DataFusion CLI a spin is via `cargo install
datafusion-cli`.
-
-Install and run using Homebrew (on MacOS)
-=========================================
-
-DataFusion CLI can also be installed via Homebrew (on MacOS). Install it as
any other pre-built software like this:
-
-.. code-block:: bash
-
- brew install datafusion
- # ==> Downloading
https://ghcr.io/v2/homebrew/core/datafusion/manifests/5.0.0
- # ########################################################################
100.0%
- # ==> Downloading
https://ghcr.io/v2/homebrew/core/datafusion/blobs/sha256:9ecc8a01be47ceb9a53b39976696afa87c0a8
- # ==> Downloading from
https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:9ecc8a01be47ceb9a53b39976
- # ########################################################################
100.0%
- # ==> Pouring datafusion--5.0.0.big_sur.bottle.tar.gz
- # 🍺 /usr/local/Cellar/datafusion/5.0.0: 9 files, 17.4MB
-
- datafusion-cli
-
-
-Run using Docker
-================
-
-There is no officially published Docker image for the DataFusion CLI, so it is
necessary to build from source
-instead.
-
-Use the following commands to clone this repository and build a Docker image
containing the CLI tool. Note that there is :code:`.dockerignore` file in the
root of the repository that may need to be deleted in order for this to work.
-
-.. code-block:: bash
-
- git clone https://github.com/apache/arrow-datafusion
- git checkout 8.0.0
- cd arrow-datafusion
- docker build -f datafusion-cli/Dockerfile . --tag datafusion-cli
- docker run -it -v $(your_data_location):/data datafusion-cli
-
-
-Usage
-=====
-
-.. code-block:: bash
-
- Apache Arrow <[email protected]>
- Command Line Client for DataFusion query engine and Ballista distributed
computation engine.
-
- USAGE:
- datafusion-cli [OPTIONS]
-
- OPTIONS:
- -c, --batch-size <BATCH_SIZE> The batch size of each query, or use
DataFusion default
- -f, --file <FILE>... Execute commands from file(s), then
exit
- --format <FORMAT> [default: table] [possible values:
csv, tsv, table, json,
- nd-json]
- -h, --help Print help information
- -p, --data-path <DATA_PATH> Path to your data, default to current
directory
- -q, --quiet Reduce printing other than the
results and work quietly
- -r, --rc <RC>... Run the provided files on startup
instead of ~/.datafusionrc
- -V, --version Print version information
-
-Type `exit` or `quit` to exit the CLI.
-
-
-Registering Parquet Data Sources
-================================
-
-Parquet data sources can be registered by executing a :code:`CREATE EXTERNAL
TABLE` SQL statement. It is not necessary to provide schema information for
Parquet files.
-
-.. code-block:: sql
-
- CREATE EXTERNAL TABLE taxi
- STORED AS PARQUET
- LOCATION '/mnt/nyctaxi/tripdata.parquet';
-
-
-Registering CSV Data Sources
-============================
-
-CSV data sources can be registered by executing a :code:`CREATE EXTERNAL
TABLE` SQL statement. It is necessary to provide schema information for CSV
files since DataFusion does not automatically infer the schema when using SQL
to query CSV files.
-
-.. code-block:: sql
-
- CREATE EXTERNAL TABLE test (
- c1 VARCHAR NOT NULL,
- c2 INT NOT NULL,
- c3 SMALLINT NOT NULL,
- c4 SMALLINT NOT NULL,
- c5 INT NOT NULL,
- c6 BIGINT NOT NULL,
- c7 SMALLINT NOT NULL,
- c8 INT NOT NULL,
- c9 BIGINT NOT NULL,
- c10 VARCHAR NOT NULL,
- c11 FLOAT NOT NULL,
- c12 DOUBLE NOT NULL,
- c13 VARCHAR NOT NULL
- )
- STORED AS CSV
- WITH HEADER ROW
- LOCATION '/path/to/aggregate_test_100.csv';
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 7075528b..63c3d26a 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -102,7 +102,7 @@ html_context = {
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
-html_logo = "_static/images/DataFusion-Logo-Background-White.png"
+html_logo = "_static/images/ballista-logo.png"
html_css_files = ["theme_overrides.css"]
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 07c93040..c8f0ba98 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -15,22 +15,13 @@
.. specific language governing permissions and limitations
.. under the License.
-=======================
-Apache Arrow DataFusion
-=======================
+=====================
+Apache Arrow Ballista
+=====================
Table of content
================
-.. _toc.usage:
-
-.. toctree::
- :maxdepth: 1
- :caption: Supported Environments
-
- Rust <https://docs.rs/crate/datafusion/>
- Python <python/index>
- Command line <cli/index>
.. _toc.guide:
@@ -39,32 +30,19 @@ Table of content
:caption: User Guide
user-guide/introduction
- user-guide/example-usage
- user-guide/library
+ user-guide/deployment/index
+ user-guide/python
+ user-guide/rust
user-guide/cli
- user-guide/sql/index
- user-guide/distributed/index
user-guide/faq
-.. _toc.specs:
-
-.. toctree::
- :maxdepth: 1
- :caption: Specification
-
- specification/roadmap
- specification/invariants
- specification/output-field-name-semantic
- specification/quarterly_roadmap
-
-.. _toc.readme:
+.. _toc.source:
.. toctree::
:maxdepth: 1
- :caption: README
+ :caption: Source Code
- DataFusion
<https://github.com/apache/arrow-datafusion/blob/master/README.md>
- Ballista
<https://github.com/apache/arrow-datafusion/tree/master/ballista/README.md>
+ Ballista <https://github.com/apache/arrow-ballista/>
.. _toc.community:
@@ -73,5 +51,5 @@ Table of content
:caption: Community
community/communication
- Issue tracker <https://github.com/apache/arrow-datafusion/issues>
- Code of conduct
<https://github.com/apache/arrow-datafusion/blob/master/CODE_OF_CONDUCT.md>
+ Issue tracker <https://github.com/apache/arrow-ballista/issues>
+ Code of conduct
<https://github.com/apache/arrow-ballista/blob/master/CODE_OF_CONDUCT.md>
diff --git a/docs/source/python/api.rst b/docs/source/python/api.rst
deleted file mode 100644
index f81753e0..00000000
--- a/docs/source/python/api.rst
+++ /dev/null
@@ -1,30 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-.. _api:
-
-*************
-API Reference
-*************
-
-.. toctree::
- :maxdepth: 2
-
- api/dataframe
- api/execution_context
- api/expression
- api/functions
diff --git a/docs/source/python/api/dataframe.rst
b/docs/source/python/api/dataframe.rst
deleted file mode 100644
index 0a3c4c8b..00000000
--- a/docs/source/python/api/dataframe.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-.. _api.dataframe:
-.. currentmodule:: datafusion
-
-DataFrame
-=========
-
-.. autosummary::
- :toctree: ../generated/
-
- DataFrame
diff --git a/docs/source/python/api/execution_context.rst
b/docs/source/python/api/execution_context.rst
deleted file mode 100644
index 5b7e0f82..00000000
--- a/docs/source/python/api/execution_context.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-.. _api.execution_context:
-.. currentmodule:: datafusion
-
-SessionContext
-================
-
-.. autosummary::
- :toctree: ../generated/
-
- SessionContext
diff --git a/docs/source/python/api/expression.rst
b/docs/source/python/api/expression.rst
deleted file mode 100644
index 45923fb5..00000000
--- a/docs/source/python/api/expression.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-.. _api.expression:
-.. currentmodule:: datafusion
-
-Expression
-==========
-
-.. autosummary::
- :toctree: ../generated/
-
- Expression
diff --git a/docs/source/python/api/functions.rst
b/docs/source/python/api/functions.rst
deleted file mode 100644
index 6f10d826..00000000
--- a/docs/source/python/api/functions.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-.. _api.functions:
-.. currentmodule:: datafusion
-
-Functions
-=========
-
-.. autosummary::
- :toctree: ../generated/
-
- functions
diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst
deleted file mode 100644
index 3cafc550..00000000
--- a/docs/source/python/index.rst
+++ /dev/null
@@ -1,191 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-====================
-DataFusion in Python
-====================
-
-This is a Python library that binds to `Apache Arrow
<https://arrow.apache.org/>`_ in-memory query engine `DataFusion
<https://github.com/apache/arrow/tree/master/rust/datafusion>`_.
-
-Like pyspark, it allows you to build a plan through SQL or a DataFrame API
against in-memory data, parquet or CSV files, run it in a multi-threaded
environment, and obtain the result back in Python.
-
-It also allows you to use UDFs and UDAFs for complex operations.
-
-The major advantage of this library over other execution engines is that this
library achieves zero-copy between Python and its execution engine: there is no
cost in using UDFs, UDAFs, and collecting the results to Python apart from
having to lock the GIL when running those operations.
-
-Its query engine, DataFusion, is written in `Rust
<https://www.rust-lang.org>`_, which makes strong assumptions about thread
safety and lack of memory leaks.
-
-Technically, zero-copy is achieved via the `c data interface
<https://arrow.apache.org/docs/format/CDataInterface.html>`_.
-
-How to use it
-=============
-
-Simple usage:
-
-.. code-block:: python
-
- import datafusion
- from datafusion import functions as f
- from datafusion import col
- import pyarrow
-
- # create a context
- ctx = datafusion.SessionContext()
-
- # create a RecordBatch and a new DataFrame from it
- batch = pyarrow.RecordBatch.from_arrays(
- [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
- names=["a", "b"],
- )
- df = ctx.create_dataframe([[batch]])
-
- # create a new statement
- df = df.select(
- col("a") + col("b"),
- col("a") - col("b"),
- )
-
- # execute and collect the first (and only) batch
- result = df.collect()[0]
-
- assert result.column(0) == pyarrow.array([5, 7, 9])
- assert result.column(1) == pyarrow.array([-3, -3, -3])
-
-
-UDFs
-----
-
-.. code-block:: python
-
- def is_null(array: pyarrow.Array) -> pyarrow.Array:
- return array.is_null()
-
- udf = f.udf(is_null, [pyarrow.int64()], pyarrow.bool_())
-
- df = df.select(udf(col("a")))
-
-
-UDAF
-----
-
-.. code-block:: python
-
- import pyarrow
- import pyarrow.compute
-
-
- class Accumulator:
- """
- Interface of a user-defined accumulation.
- """
- def __init__(self):
- self._sum = pyarrow.scalar(0.0)
-
- def to_scalars(self) -> [pyarrow.Scalar]:
- return [self._sum]
-
- def update(self, values: pyarrow.Array) -> None:
- # not nice since pyarrow scalars can't be summed yet. This breaks
on `None`
- self._sum = pyarrow.scalar(self._sum.as_py() +
pyarrow.compute.sum(values).as_py())
-
- def merge(self, states: pyarrow.Array) -> None:
- # not nice since pyarrow scalars can't be summed yet. This breaks
on `None`
- self._sum = pyarrow.scalar(self._sum.as_py() +
pyarrow.compute.sum(states).as_py())
-
- def evaluate(self) -> pyarrow.Scalar:
- return self._sum
-
-
- df = ...
-
- udaf = f.udaf(Accumulator, pyarrow.float64(), pyarrow.float64(),
[pyarrow.float64()])
-
- df = df.aggregate(
- [],
- [udaf(col("a"))]
- )
-
-
-How to install (from pip)
-=========================
-
-.. code-block:: shell
-
- pip install datafusion
-
-
-How to develop
-==============
-
-This assumes that you have rust and cargo installed. We use the workflow
recommended by `pyo3 <https://github.com/PyO3/pyo3>`_ and `maturin
<https://github.com/PyO3/maturin>`_.
-
-Bootstrap:
-
-.. code-block:: shell
-
- # fetch this repo
- git clone [email protected]:apache/arrow-datafusion.git
-
- cd arrow-datafusion/python
-
- # prepare development environment (used to build wheel / install in
development)
- python3 -m venv venv
- # activate the venv
- source venv/bin/activate
- pip install -r requirements.txt
-
-
-Whenever rust code changes (your changes or via `git pull`):
-
-.. code-block:: shell
-
- # make sure you activate the venv using "source venv/bin/activate" first
- maturin develop
- python -m pytest
-
-
-How to update dependencies
-==========================
-
-To change test dependencies, change the `requirements.in` and run
-
-.. code-block:: shell
-
- # install pip-tools (this can be done only once), also consider running in
venv
- pip install pip-tools
-
- # change requirements.in and then run
- pip-compile --generate-hashes
-
-
-To update dependencies, run
-
-.. code-block:: shell
-
- pip-compile update
-
-
-More details about pip-tools `here <https://github.com/jazzband/pip-tools>`_
-
-
-API reference
-=============
-
-.. toctree::
- :maxdepth: 2
-
- api
diff --git a/docs/source/specification/invariants.md
b/docs/source/specification/invariants.md
deleted file mode 100644
index 17b7c1db..00000000
--- a/docs/source/specification/invariants.md
+++ /dev/null
@@ -1,327 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# DataFusion's Invariants
-
-This document enumerates invariants of DataFusion's logical and physical planes
-(functions, and nodes). Some of these invariants are currently not enforced.
-This document assumes that the reader is familiar with some of the codebase,
-including rust arrow's RecordBatch and Array.
-
-## Rational
-
-DataFusion's computational model is built on top of a dynamically typed arrow
-object, Array, that offers the interface `Array::as_any` to downcast itself to
-its statically typed versions (e.g. `Int32Array`). DataFusion uses
-`Array::data_type` to perform the respective downcasting on its physical
-operations. DataFusion uses a dynamic type system because the queries being
-executed are not always known at compile time: they are only known during the
-runtime (or query time) of programs built with DataFusion. This document is
-built on top of this principle.
-
-In dynamically typed interfaces, it is up to developers to enforce type
-invariances. This document declares some of these invariants, so that users
-know what they can expect from a query in DataFusion, and DataFusion developers
-know what they need to enforce at the coding level.
-
-## Notation
-
-- Field or physical field: the tuple name, `arrow::DataType` and nullability
flag (a bool whether values can be null), represented in this document by
`PF(name, type, nullable)`
-- Logical field: Field with a relation name. Represented in this document by
`LF(relation, name, type, nullable)`
-- Projected plan: plan with projection as the root node.
-- Logical schema: a vector of logical fields, used by logical plan.
-- Physical schema: a vector of physical fields, used by both physical plan and
Arrow record batch.
-
-### Logical
-
-#### Function
-
-An object that knows its valid incoming logical fields and how to derive its
-output logical field from its arguments' logical fields. A functions' output
-field is itself a function of its input fields:
-
-```
-logical_field(lf1: LF, lf2: LF, ...) -> LF
-```
-
-Examples:
-
-- `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable |
b.nullable)` where d is the function mapping input types to output type
(`get_supertype` in our current implementation).
-- `length(a) -> LF(None, "length({a})", u32, a.nullable)`
-
-#### Plan
-
-A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 -
c2 AS sum12; Scan c1 as u32, c2 as u64`)
-that knows how to derive its schema.
-
-Certain plans have a frozen schema (e.g. Scan), while others derive their
-schema from their child nodes.
-
-#### Column
-
-An identifier in a logical plan consists of field name and relation name.
-
-### Physical
-
-#### Function
-
-An object that knows how to derive its physical field from its arguments'
-physical fields, and also how to actually perform the computation on data. A
-functions' output physical field is a function of its input physical fields:
-
-```
-physical_field(PF1, PF2, ...) -> PF
-```
-
-Examples:
-
-- `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)`
where d is a complex function (`get_supertype` in our current implementation)
whose computation is for each element in the columns, sum the two entries
together and return it in the same type as the smallest type of both columns.
-- `length(&str) -> PF("length({a})", u32, a.nullable)` whose computation is
"count number of bytes in the string".
-
-#### Plan
-
-A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
-that knows how to derive its metadata and compute itself.
-
-Note how the physical plane does not know how to derive field names: field
-names are solely a property of the logical plane, as they are not needed in the
-physical plane.
-
-#### Column
-
-A type of physical node in a physical plan consists of a field name and unique
index.
-
-### Data Sources' registry
-
-A map of source name/relation -> Schema plus associated properties necessary
to read data from it (e.g. file path).
-
-### Functions' registry
-
-A map of function name -> logical + physical function.
-
-### Physical Planner
-
-A function that knows how to derive a physical plan from a logical plan:
-
-```
-plan(LogicalPlan) -> PhysicalPlan
-```
-
-### Logical Optimizer
-
-A function that accepts a logical plan and returns an (optimized) logical plan
-which computes the same results, but in a more efficient manner:
-
-```
-optimize(LogicalPlan) -> LogicalPlan
-```
-
-### Physical Optimizer
-
-A function that accepts a physical plan and returns an (optimized) physical
-plan which computes the same results, but may differ based on the actual
-hardware or execution environment being run:
-
-```
-optimize(PhysicalPlan) -> PhysicalPlan
-```
-
-### Builder
-
-A function that knows how to build a new logical plan from an existing logical
-plan and some extra parameters.
-
-```
-build(logical_plan, params...) -> logical_plan
-```
-
-## Invariants
-
-The following subsections describe invariants. Since functions' output schema
-depends on its arguments' schema (e.g. min, plus), the resulting schema can
-only be derived based on a known set of input schemas (TableProvider).
-Likewise, schemas of functions depend on the specific registry of functions
-registered (e.g. does `my_op` return u32 or u64?). Thus, in this section, the
-wording "same schema" is understood to mean "same schema under a given registry
-of data sources and functions".
-
-### (relation, name) tuples in logical fields and logical columns are unique
-
-Every logical field's (relation, name) tuple in a logical schema MUST be
unique.
-Every logical column's (relation, name) tuple in a logical plan MUST be unique.
-
-This invariant guarantees that `SELECT t1.id, t2.id FROM t1 JOIN t2...`
-unambiguously selects the field `t1.id` and `t2.id` in a logical schema in the
-logical plane.
-
-#### Responsibility
-
-It is the logical builder and optimizer's responsibility to guarantee this
-invariant.
-
-#### Validation
-
-Builder and optimizer MUST error if this invariant is violated on any logical
-node that creates a new schema (e.g. scan, projection, aggregation, join,
etc.).
-
-### Physical schema is consistent with data
-
-The contents of every Array in every RecordBatch in every partition returned by
-a physical plan MUST be consistent with RecordBatch's schema, in that every
-Array in the RecordBatch must be downcastable to its corresponding type
-declared in the RecordBatch.
-
-#### Responsibility
-
-Physical functions MUST guarantee this invariant. This is particularly
-important in aggregate functions, whose aggregating type may be different from
-the intermediary types during calculations (e.g. sum(i32) -> i64).
-
-#### Validation
-
-Since the validation of this invariant is computationally expensive, execution
-contexts CAN validate this invariant. It is acceptable for physical nodes to
-`panic!` if their input does not satisfy this invariant.
-
-### Physical schema is consistent in physical functions
-
-The schema of every Array returned by a physical function MUST match the
-DataType reported by the physical function itself.
-
-This ensures that when a physical function claims that it returns a type
-(e.g. Int32), users can safely downcast its resulting Array to the
-corresponding type (e.g. Int32Array), as well as to write data to formats that
-have a schema with nullability flag (e.g. parquet).
-
-#### Responsibility
-
-It is the responsibility of the developer that writes a physical function to
-guarantee this invariant.
-
-In particular:
-
-- The derived DataType matches the code it uses to build the array for every
branch of valid input type combinations.
-- The nullability flag matches how the values are built.
-
-#### Validation
-
-Since the validation of this invariant is computationally expensive, execution
-contexts CAN validate this invariant.
-
-### The physical schema is invariant under planning
-
-The physical schema derived by a physical plan returned by the planner MUST be
-equivalent to the physical schema derived by the logical plan passed to the
-planner. Specifically:
-
-```
-plan(logical_plan).schema === logical_plan.physical_schema
-```
-
-Logical plan's physical schema is defined as logical schema with relation
-qualifiers stripped for all logical fields:
-
-```
-logical_plan.physical_schema = vector[ strip_relation(f) for f in
logical_plan.logical_fields ]
-```
-
-This is used to ensure that the physical schema of its (logical) plan is what
-it gets in record batches, so that users can rely on the optimized logical plan
-to know the resulting physical schema.
-
-Note that since a logical plan can be as simple as a single projection with a
-single function, `Projection f(c1,c2)`, a corollary of this is that the
-physical schema of every `logical function -> physical function` must be
-invariant under planning.
-
-#### Responsibility
-
-Developers of physical and logical plans and planners MUST guarantee this
-invariant for every triplet (logical plan, physical plan, conversion rule).
-
-#### Validation
-
-Planners MUST validate this invariant. In particular they MUST return an error
-when, during planning, a physical function's derived schema does not match the
-logical functions' derived schema.
-
-### The output schema equals the physical plan schema
-
-The schema of every RecordBatch in every partition outputted by a physical plan
-MUST be equal to the schema of the physical plan. Specifically:
-
-```
-physical_plan.evaluate(batch).schema = physical_plan.schema
-```
-
-Together with other invariants, this ensures that the consumers of record
-batches do not need to know the output schema of the physical plan; they can
-safely rely on the record batch's schema to perform downscaling and naming.
-
-#### Responsibility
-
-Physical nodes MUST guarantee this invariant.
-
-#### Validation
-
-Execution Contexts CAN validate this invariant.
-
-### Logical schema is invariant under logical optimization
-
-The logical schema derived by a projected logical plan returned by the logical
-optimizer MUST be equivalent to the logical schema derived by the logical plan
-passed to the planner:
-
-```
-optimize(logical_plan).schema === logical_plan.schema
-```
-
-This is used to ensure that plans can be optimized without jeopardizing future
-referencing logical columns (name and index) or assumptions about their
-schemas.
-
-#### Responsibility
-
-Logical optimizers MUST guarantee this invariant.
-
-#### Validation
-
-Users of logical optimizers SHOULD validate this invariant.
-
-### Physical schema is invariant under physical optimization
-
-The physical schema derived by a projected physical plan returned by the
-physical optimizer MUST match the physical schema derived by the physical plan
-passed to the planner:
-
-```
-optimize(physical_plan).schema === physical_plan.schema
-```
-
-This is used to ensure that plans can be optimized without jeopardizing future
-referencs of logical columns (name and index) or assumptions about their
-schemas.
-
-#### Responsibility
-
-Optimizers MUST guarantee this invariant.
-
-#### Validation
-
-Users of optimizers SHOULD validate this invariant.
diff --git a/docs/source/specification/output-field-name-semantic.md
b/docs/source/specification/output-field-name-semantic.md
deleted file mode 100644
index c8665734..00000000
--- a/docs/source/specification/output-field-name-semantic.md
+++ /dev/null
@@ -1,212 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# DataFusion output field name semantic
-
-This specification documents how field names in output record batches should be
-generated based on given user queries. The filed name rules apply to
-DataFusion queries planned from both SQL queries and Dataframe APIs.
-
-## Field name rules
-
-- All bare column field names MUST not contain relation/table qualifier.
- - Both `SELECT t1.id`, `SELECT id` and `df.select_columns(&["id"])` SHOULD
result in field name: `id`
-- All compound column field names MUST contain relation/table qualifier.
- - `SELECT foo + bar` SHOULD result in field name: `table.foo PLUS table.bar`
-- Function names MUST be converted to lowercase.
- - `SELECT AVG(c1)` SHOULD result in field name: `avg(table.c1)`
-- Literal string MUST not be wrapped with quotes or double quotes.
- - `SELECT 'foo'` SHOULD result in field name: `foo`
-- Operator expressions MUST be wrapped with parentheses.
- - `SELECT -2` SHOULD result in field name: `(- 2)`
-- Operator and operand MUST be separated by spaces.
- - `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
-- Function arguments MUST be separated by a comma `,` and a space.
- - `SELECT f(c1,c2)` and `df.select(vec![f.udf("f")?.call(vec![col("c1"),
col("c2")])])` SHOULD result in field name: `f(table.c1, table.c2)`
-
-## Appendices
-
-### Examples and comparison with other systems
-
-Data schema for test sample queries:
-
-```
-CREATE TABLE t1 (id INT, a VARCHAR(5));
-INSERT INTO t1 (id, a) VALUES (1, 'foo');
-INSERT INTO t1 (id, a) VALUES (2, 'bar');
-
-CREATE TABLE t2 (id INT, b VARCHAR(5));
-INSERT INTO t2 (id, b) VALUES (1, 'hello');
-INSERT INTO t2 (id, b) VALUES (2, 'world');
-```
-
-#### Projected columns
-
-Query:
-
-```
-SELECT t1.id, a, t2.id, b
-FROM t1
-JOIN t2 ON t1.id = t2.id
-```
-
-DataFusion Arrow record batches output:
-
-| id | a | id | b |
-| --- | --- | --- | ----- |
-| 1 | foo | 1 | hello |
-| 2 | bar | 2 | world |
-
-Spark, MySQL 8 and PostgreSQL 13 output:
-
-| id | a | id | b |
-| --- | --- | --- | ----- |
-| 1 | foo | 1 | hello |
-| 2 | bar | 2 | world |
-
-SQLite 3 output:
-
-| id | a | b |
-| --- | --- | ----- |
-| 1 | foo | hello |
-| 2 | bar | world |
-
-#### Function transformed columns
-
-Query:
-
-```
-SELECT ABS(t1.id), abs(-id) FROM t1;
-```
-
-DataFusion Arrow record batches output:
-
-| abs(t1.id) | abs((- t1.id)) |
-| ---------- | -------------- |
-| 1 | 1 |
-| 2 | 2 |
-
-Spark output:
-
-| abs(id) | abs((- id)) |
-| ------- | ----------- |
-| 1 | 1 |
-| 2 | 2 |
-
-MySQL 8 output:
-
-| ABS(t1.id) | abs(-id) |
-| ---------- | -------- |
-| 1 | 1 |
-| 2 | 2 |
-
-PostgreSQL 13 output:
-
-| abs | abs |
-| --- | --- |
-| 1 | 1 |
-| 2 | 2 |
-
-SQlite 3 output:
-
-| ABS(t1.id) | abs(-id) |
-| ---------- | -------- |
-| 1 | 1 |
-| 2 | 2 |
-
-#### Function with operators
-
-Query:
-
-```
-SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1;
-```
-
-DataFusion Arrow record batches output:
-
-| t1.id + abs(t1.id) | abs(t1.id \* t1.id) |
-| ------------------ | ------------------- |
-| 2 | 1 |
-| 4 | 4 |
-
-Spark output:
-
-| id + abs(id) | abs(id \* id) |
-| ------------ | ------------- |
-| 2 | 1 |
-| 4 | 4 |
-
-MySQL 8 output:
-
-| t1.id + ABS(id) | ABS(id \* t1.id) |
-| --------------- | ---------------- |
-| 2 | 1 |
-| 4 | 4 |
-
-PostgreSQL output:
-
-| ?column? | abs |
-| -------- | --- |
-| 2 | 1 |
-| 4 | 4 |
-
-SQLite output:
-
-| t1.id + ABS(id) | ABS(id \* t1.id) |
-| --------------- | ---------------- |
-| 2 | 1 |
-| 4 | 4 |
-
-#### Project literals
-
-Query:
-
-```
-SELECT 1, 2+5, 'foo_bar';
-```
-
-DataFusion Arrow record batches output:
-
-| 1 | (2 + 5) | foo_bar |
-| --- | ------- | ------- |
-| 1 | 7 | foo_bar |
-
-Spark output:
-
-| 1 | (2 + 5) | foo_bar |
-| --- | ------- | ------- |
-| 1 | 7 | foo_bar |
-
-MySQL output:
-
-| 1 | 2+5 | foo_bar |
-| --- | --- | ------- |
-| 1 | 7 | foo_bar |
-
-PostgreSQL output:
-
-| ?column? | ?column? | ?column? |
-| -------- | -------- | -------- |
-| 1 | 7 | foo_bar |
-
-SQLite 3 output:
-
-| 1 | 2+5 | 'foo_bar' |
-| --- | --- | --------- |
-| 1 | 7 | foo_bar |
diff --git a/docs/source/specification/quarterly_roadmap.md
b/docs/source/specification/quarterly_roadmap.md
deleted file mode 100644
index 94c7dd9e..00000000
--- a/docs/source/specification/quarterly_roadmap.md
+++ /dev/null
@@ -1,90 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Roadmap
-
-A quarterly roadmap will be published to give the DataFusion community
visibility into the priorities of the projects contributors. This roadmap is
not binding.
-
-## 2022 Q2
-
-### DataFusion Core
-
-- IO Improvements
- - Reading, registering, and writing more file formats from both DataFrame
API and SQL
- - Additional options for IO including partitioning and metadata support
-- Work Scheduling
- - Improve predictability, observability and performance of IO and CPU-bound
work
- - Develop a more explicit story for managing parallelism during plan
execution
-- Memory Management
- - Add more operators for memory limited execution
-- Performance
- - Incorporate row-format into operators such as aggregate
- - Add row-format benchmarks
- - Explore JIT-compiling complex expressions
- - Explore LLVM for JIT, with inline Rust functions as the primary goal
- - Improve performance of Sort and Merge using Row Format / JIT expressions
-- Documentation
- - General improvements to DataFusion website
- - Publish design documents
-- Streaming
- - Create `StreamProvider` trait
-
-### Ballista
-
-- Make production ready
- - Shuffle file cleanup
- - Fill functional gaps between DataFusion and Ballista
- - Improve task scheduling and data exchange efficiency
- - Better error handling
- - Task failure
- - Executor lost
- - Schedule restart
- - Improve monitoring and logging
- - Auto scaling support
-- Support for multi-scheduler deployments. Initially for resiliency and fault
tolerance but ultimately to support sharding for scalability and more efficient
caching.
-- Executor deployment grouping based on resource allocation
-
-### Extensions ([datafusion-contrib](https://github.com/datafusion-contrib]))
-
-####
[DataFusion-Python](https://github.com/datafusion-contrib/datafusion-python)
-
-- Add missing functionality to DataFrame and SessionContext
-- Improve documentation
-
-####
[DataFusion-S3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
-
-- Create Python bindings to use with datafusion-python
-
-#### [DataFusion-Tui](https://github.com/datafusion-contrib/datafusion-tui)
-
-- Create multiple SQL editors
-- Expose more Context and query metadata
-- Support new data sources
- - BigTable, HDFS, HTTP APIs
-
-####
[DataFusion-BigTable](https://github.com/datafusion-contrib/datafusion-bigtable)
-
-- Python binding to use with datafusion-python
-- Timestamp range predicate pushdown
-- Multi-threaded partition aware execution
-- Production ready Rust SDK
-
-####
[DataFusion-Streams](https://github.com/datafusion-contrib/datafusion-streams)
-
-- Create experimental implementation of `StreamProvider` trait
diff --git a/docs/source/specification/rfcs/template.md
b/docs/source/specification/rfcs/template.md
deleted file mode 100644
index a6f79fe9..00000000
--- a/docs/source/specification/rfcs/template.md
+++ /dev/null
@@ -1,58 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-Feature Name:
-
-Status: draft/in-progress/completed/
-
-Start Date: YYYY-MM-DD
-
-Authors:
-
-RFC PR: #
-
-DataFusion Issue: #
-
----
-
-### Background
-
----
-
-### Goals
-
----
-
-### Non-Goals
-
----
-
-### Survey
-
----
-
-### General design
-
----
-
-### Detailed design
-
----
-
-### Others
diff --git a/docs/source/specification/roadmap.md
b/docs/source/specification/roadmap.md
deleted file mode 100644
index 76b2896a..00000000
--- a/docs/source/specification/roadmap.md
+++ /dev/null
@@ -1,118 +0,0 @@
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Roadmap
-
-This document describes high level goals of the DataFusion and
-Ballista development community. It is not meant to restrict
-possibilities, but rather help newcomers understand the broader
-context of where the community is headed, and inspire
-additional contributions.
-
-DataFusion and Ballista are part of the [Apache
-Arrow](https://arrow.apache.org/) project and governed by the Apache
-Software Foundation governance model. These projects are entirely
-driven by volunteers, and we welcome contributions for items not on
-this roadmap. However, before submitting a large PR, we strongly
-suggest you start a coversation using a github issue or the
[email protected] mailing list to make review efficient and avoid
-surprises.
-
-# DataFusion
-
-DataFusion's goal is to become the embedded query engine of choice
-for new analytic applications, by leveraging the unique features of
-[Rust](https://www.rust-lang.org/) and [Apache
Arrow](https://arrow.apache.org/)
-to provide:
-
-1. Best-in-class single node query performance
-2. A Declarative SQL query interface compatible with PostgreSQL
-3. A Dataframe API, similar to those offered by Pandas and Spark
-4. A Procedural API for programatically creating and running execution plans
-5. High performance, data race free, erogonomic extensibility points at at
every layer
-
-## Additional SQL Language Features
-
-- Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122)
-- Complete support list on
[status](https://github.com/apache/arrow-datafusion/blob/master/README.md#status)
-- Timestamp Arithmetic
[#194](https://github.com/apache/arrow-datafusion/issues/194)
-- SQL Parser extension point
[#533](https://github.com/apache/arrow-datafusion/issues/533)
-- Support for nested structures (fields, lists, structs)
[#119](https://github.com/apache/arrow-datafusion/issues/119)
-- Run all queries from the TPCH benchmark (see
[milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more
details)
-
-## Query Optimizer
-
-- More sophisticated cost based optimizer for join ordering
-- Implement advanced query optimization framework (Tokomak) #440
-- Finer optimizations for group by and aggregate functions
-
-## Datasources
-
-- Better support for reading data from remote filesystems (e.g. S3) without
caching it locally
[#907](https://github.com/apache/arrow-datafusion/issues/907)
[#1060](https://github.com/apache/arrow-datafusion/issues/1060)
-- Improve performances of file format datasources (parallelize file listings,
async Arrow readers, file chunk prefetching capability...)
-
-## Runtime / Infrastructure
-
-- Migrate to some sort of arrow2 based implementation (see
[milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more
details)
-- Add DataFusion to h2oai/db-benchmark
[147](https://github.com/apache/arrow-datafusion/issues/147)
-- Improve build time
[348](https://github.com/apache/arrow-datafusion/issues/348)
-
-## Resource Management
-
-- Finer grain control and limit of runtime memory
[#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage
[#54](https://github.com/apache/arrow-datafusion/issues/64)
-
-## Python Interface
-
-TBD
-
-## DataFusion CLI (`datafusion-cli`)
-
-Note: There are some additional thoughts on a datafusion-cli vision on
[#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).
-
-- Better abstraction between REPL parsing and queries so that commands are
separated and handled correctly
-- Connect to the `Statistics` subsystem and have the cli print out more stats
for query debugging, etc.
-- Improved error handling for interactive use and shell scripting usage
-- publishing to apt, brew, and possible NuGet registry so that people can use
it more easily
-- adopt a shorter name, like dfcli?
-
-# Ballista
-
-Ballista is a distributed compute platform based on Apache Arrow and
DataFusion. It provides a query scheduler that
-breaks a physical plan into stages and tasks and then schedules tasks for
execution across the available executors
-in the cluster.
-
-Having Ballista as part of the DataFusion codebase helps ensure that
DataFusion remains suitable for distributed
-compute. For example, it helps ensure that physical query plans can be
serialized to protobuf format and that they
-remain language-agnostic so that executors can be built in languages other
than Rust.
-
-## Ballista Roadmap
-
-## Move query scheduler into DataFusion
-
-The Ballista scheduler has some advantages over DataFusion query execution
because it doesn't try to eagerly execute
-the entire query at once but breaks it down into a directionally-acyclic graph
(DAG) of stages and executes a
-configurable number of stages and tasks concurrently. It should be possible to
push some of this logic down to
-DataFusion so that the same scheduler can be used to scale across cores
in-process and across nodes in a cluster.
-
-## Implement execution-time cost-based optimizations based on statistics
-
-After the execution of a query stage, accurate statistics are available for
the resulting data. These statistics
-could be leveraged by the scheduler to optimize the query during execution.
For example, when performing a hash join
-it is desirable to load the smaller side of the join into memory and in some
cases we cannot predict which side will
-be smaller until execution time.
diff --git a/docs/source/user-guide/distributed/deployment/cargo-install.md
b/docs/source/user-guide/deployment/cargo-install.md
similarity index 100%
rename from docs/source/user-guide/distributed/deployment/cargo-install.md
rename to docs/source/user-guide/deployment/cargo-install.md
diff --git a/docs/source/user-guide/distributed/deployment/configuration.md
b/docs/source/user-guide/deployment/configuration.md
similarity index 100%
rename from docs/source/user-guide/distributed/deployment/configuration.md
rename to docs/source/user-guide/deployment/configuration.md
diff --git a/docs/source/user-guide/distributed/deployment/docker-compose.md
b/docs/source/user-guide/deployment/docker-compose.md
similarity index 100%
rename from docs/source/user-guide/distributed/deployment/docker-compose.md
rename to docs/source/user-guide/deployment/docker-compose.md
diff --git a/docs/source/user-guide/distributed/deployment/docker.md
b/docs/source/user-guide/deployment/docker.md
similarity index 100%
rename from docs/source/user-guide/distributed/deployment/docker.md
rename to docs/source/user-guide/deployment/docker.md
diff --git a/docs/source/user-guide/distributed/deployment/index.rst
b/docs/source/user-guide/deployment/index.rst
similarity index 98%
rename from docs/source/user-guide/distributed/deployment/index.rst
rename to docs/source/user-guide/deployment/index.rst
index f5e41d01..ad9c0714 100644
--- a/docs/source/user-guide/distributed/deployment/index.rst
+++ b/docs/source/user-guide/deployment/index.rst
@@ -25,5 +25,4 @@ Start a Ballista Cluster
docker
docker-compose
kubernetes
- raspberrypi
configuration
diff --git a/docs/source/user-guide/distributed/deployment/kubernetes.md
b/docs/source/user-guide/deployment/kubernetes.md
similarity index 100%
rename from docs/source/user-guide/distributed/deployment/kubernetes.md
rename to docs/source/user-guide/deployment/kubernetes.md
diff --git a/docs/source/user-guide/distributed/clients/cli.rst
b/docs/source/user-guide/distributed/clients/cli.rst
deleted file mode 100644
index d5cf30b8..00000000
--- a/docs/source/user-guide/distributed/clients/cli.rst
+++ /dev/null
@@ -1,111 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-=======================
-Ballista Command-line
-=======================
-
-The Arrow Ballista CLI is a command-line interactive SQL utility that allows
-queries to be executed against CSV and Parquet files. It is a convenient way to
-try Ballista out with your own data sources.
-
-Install and run using Cargo
-===========================
-
-The easiest way to install Ballista CLI a spin is via `cargo install
ballista-cli`.
-
-Run using Docker
-================
-
-There is no officially published Docker image for the Ballista CLI, so it is
necessary to build from source
-instead.
-
-Use the following commands to clone this repository and build a Docker image
containing the CLI tool. Note that there is :code:`.dockerignore` file in the
root of the repository that may need to be deleted in order for this to work.
-
-.. code-block:: bash
-
- git clone https://github.com/apache/arrow-datafusion
- git checkout 8.0.0
- cd arrow-datafusion
- docker build -f ballista-cli/Dockerfile . --tag ballista-cli
- docker run -it -v $(your_data_location):/data ballista-cli
-
-
-Usage
-=====
-
-.. code-block:: bash
-
- Apache Arrow <[email protected]>
- Command Line Client for Ballista distributed query engine.
-
- USAGE:
- ballista-cli [OPTIONS]
-
- OPTIONS:
- -c, --batch-size <BATCH_SIZE> The batch size of each query, or use
DataFusion default
- -f, --file <FILE>... Execute commands from file(s), then
exit
- --format <FORMAT> [default: table] [possible values:
csv, tsv, table, json,
- nd-json]
- -h, --help Print help information
- --host <HOST> Ballista scheduler host
- -p, --data-path <DATA_PATH> Path to your data, default to current
directory
- --port <PORT> Ballista scheduler port
- -q, --quiet Reduce printing other than the
results and work quietly
- -r, --rc <RC>... Run the provided files on startup
instead of ~/.datafusionrc
- -V, --version Print version information
-
-Type `exit` or `quit` to exit the CLI.
-
-
-Registering Parquet Data Sources
-================================
-
-Parquet data sources can be registered by executing a :code:`CREATE EXTERNAL
TABLE` SQL statement. It is not necessary to provide schema information for
Parquet files.
-
-.. code-block:: sql
-
- CREATE EXTERNAL TABLE taxi
- STORED AS PARQUET
- LOCATION '/mnt/nyctaxi/tripdata.parquet';
-
-
-Registering CSV Data Sources
-============================
-
-CSV data sources can be registered by executing a :code:`CREATE EXTERNAL
TABLE` SQL statement. It is necessary to provide schema information for CSV
files since DataFusion does not automatically infer the schema when using SQL
to query CSV files.
-
-.. code-block:: sql
-
- CREATE EXTERNAL TABLE test (
- c1 VARCHAR NOT NULL,
- c2 INT NOT NULL,
- c3 SMALLINT NOT NULL,
- c4 SMALLINT NOT NULL,
- c5 INT NOT NULL,
- c6 BIGINT NOT NULL,
- c7 SMALLINT NOT NULL,
- c8 INT NOT NULL,
- c9 BIGINT NOT NULL,
- c10 VARCHAR NOT NULL,
- c11 FLOAT NOT NULL,
- c12 DOUBLE NOT NULL,
- c13 VARCHAR NOT NULL
- )
- STORED AS CSV
- WITH HEADER ROW
- LOCATION '/path/to/aggregate_test_100.csv';
diff --git a/docs/source/user-guide/distributed/clients/index.rst
b/docs/source/user-guide/distributed/clients/index.rst
deleted file mode 100644
index 6199bca5..00000000
--- a/docs/source/user-guide/distributed/clients/index.rst
+++ /dev/null
@@ -1,26 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-Clients
-=======
-
-.. toctree::
- :maxdepth: 2
-
- cli
- rust
- python
diff --git a/docs/source/user-guide/distributed/deployment/raspberrypi.md
b/docs/source/user-guide/distributed/deployment/raspberrypi.md
deleted file mode 100644
index 3bf36c72..00000000
--- a/docs/source/user-guide/distributed/deployment/raspberrypi.md
+++ /dev/null
@@ -1,129 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Running Ballista on Raspberry Pi
-
-The Raspberry Pi single-board computer provides a fun and relatively
inexpensive way to get started with distributed
-computing.
-
-These instructions have been tested using an Ubuntu Linux desktop as the host,
and a
-[Raspberry Pi 4 Model
B](https://www.raspberrypi.org/products/raspberry-pi-4-model-b/) with 4 GB RAM
as the target.
-
-## Preparing the Raspberry Pi
-
-We recommend installing the 64-bit version of [Ubuntu for Raspberry
Pi](https://ubuntu.com/raspberry-pi).
-
-The Rust implementation of Arrow does not work correctly on 32-bit ARM
architectures
-([issue](https://github.com/apache/arrow-rs/issues/109)).
-
-## Cross Compiling DataFusion for the Raspberry Pi
-
-We do not yet publish official Docker images as part of the release process,
although we do plan to do this in the
-future ([issue #228](https://github.com/apache/arrow-datafusion/issues/228)).
-
-Although it is technically possible to build DataFusion directly on a
Raspberry Pi, it really isn't very practical.
-It is much faster to use [cross](https://github.com/rust-embedded/cross) to
cross-compile from a more powerful
-desktop computer.
-
-Docker must be installed and the Docker daemon must be running before
cross-compiling with cross. See the
-[cross](https://github.com/rust-embedded/cross) project for more detailed
instructions.
-
-Run the following command to install cross.
-
-```bash
-cargo install cross
-```
-
-From the root of the DataFusion project, run the following command to
cross-compile for ARM 64 architecture.
-
-```bash
-cross build --release --target aarch64-unknown-linux-gnu
-```
-
-It is even possible to cross-test from your desktop computer:
-
-```bash
-cross test --target aarch64-unknown-linux-gnu
-```
-
-## Deploying the binaries to Raspberry Pi
-
-You should now be able to copy the executable to the Raspberry Pi using scp on
Linux. You will need to change the IP
-address in these commands to be the IP address for your Raspberry Pi. The
easiest way to find this is to connect a
-keyboard and monitor to the Pi and run `ifconfig`.
-
-```bash
-scp ./target/aarch64-unknown-linux-gnu/release/ballista-scheduler
[email protected]:
-scp ./target/aarch64-unknown-linux-gnu/release/ballista-executor
[email protected]:
-```
-
-Finally, ssh into the Pi and make the binaries executable:
-
-```bash
-ssh [email protected]
-chmod +x ballista-scheduler ballista-executor
-```
-
-It is now possible to run the Ballista scheduler and executor natively on the
Pi.
-
-## Docker
-
-Using Docker's `buildx` cross-platform functionality, we can also build a
docker image targeting ARM64
-from any desktop environment. This will require write access to a Docker
repository
-on [Docker Hub](https://hub.docker.com/) because the resulting Docker image
will be pushed directly
-to the repo.
-
-```bash
-DOCKER_REPO=myrepo ./dev/build-ballista-docker-arm64.sh
-```
-
-On the Raspberry Pi:
-
-```bash
-docker pull myrepo/ballista-arm64
-```
-
-Run a scheduler:
-
-```bash
-docker run -it myrepo/ballista-arm64 /ballista-scheduler
-```
-
-Run an executor:
-
-```bash
-docker run -it myrepo/ballista-arm64 /ballista-executor
-```
-
-Run the benchmarks:
-
-```bash
-docker run -it myrepo/ballista-arm64 \
- /tpch benchmark datafusion --query=1 --path=/path/to/data --format=parquet \
- --concurrency=24 --iterations=1 --debug --host=ballista-scheduler
--bind-port=50050
-```
-
-Note that it will be necessary to mount appropriate volumes into the
containers and also configure networking
-so that the Docker containers can communicate with each other. This can be
achieved using Docker compose or Kubernetes.
-
-## Kubernetes
-
-With Docker images built using the instructions above, it is now possible to
deploy Ballista to a Kubernetes cluster
-running on one of more Raspberry Pi computers. Refer to the instructions in
the [Kubernetes](kubernetes.md) chapter
-for more information, and remember to change the Docker image name to
`myrepo/ballista-arm64`.
diff --git a/docs/source/user-guide/distributed/index.rst
b/docs/source/user-guide/distributed/index.rst
deleted file mode 100644
index abb3c7b1..00000000
--- a/docs/source/user-guide/distributed/index.rst
+++ /dev/null
@@ -1,26 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-Ballista Distributed Compute
-============================
-
-.. toctree::
- :maxdepth: 2
-
- introduction
- deployment/index
- clients/index
diff --git a/docs/source/user-guide/distributed/introduction.md
b/docs/source/user-guide/distributed/introduction.md
deleted file mode 100644
index 77db6261..00000000
--- a/docs/source/user-guide/distributed/introduction.md
+++ /dev/null
@@ -1,50 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Overview
-
-Ballista is a distributed compute platform primarily implemented in Rust, and
powered by Apache Arrow. It is
-built on an architecture that allows other programming languages to be
supported as first-class citizens without paying
-a penalty for serialization costs.
-
-The foundational technologies in Ballista are:
-
-- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels
for efficient processing of data.
-- [Apache Arrow Flight
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
for efficient data transfer between processes.
-- [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
for serializing query plans.
-- [DataFusion](https://github.com/apache/arrow-datafusion/) for query
execution.
-
-## How does this compare to Apache Spark?
-
-Although Ballista is largely inspired by Apache Spark, there are some key
differences.
-
-- The choice of Rust as the main execution language means that memory usage is
deterministic and avoids the overhead
- of GC pauses.
-- Ballista is designed from the ground up to use columnar data, enabling a
number of efficiencies such as vectorized
- processing (SIMD and GPU) and efficient compression. Although Spark does
have some columnar support, it is still
- largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and
memory usage can be 5x - 10x lower than
- Apache Spark in some cases, which means that more processing can fit on a
single node, reducing the overhead of
- distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that
data can be exchanged between executors
- in any programming language with minimal serialization overhead.
-
-## Status
-
-Ballista is still in the early stages of development but is capable of
executing complex analytical queries at scale.
diff --git a/docs/source/user-guide/example-usage.md
b/docs/source/user-guide/example-usage.md
deleted file mode 100644
index de8cae29..00000000
--- a/docs/source/user-guide/example-usage.md
+++ /dev/null
@@ -1,79 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Example Usage
-
-## Update `Cargo.toml`
-
-Add the following to your `Cargo.toml` file:
-
-```toml
-datafusion = "8.0.0"
-tokio = "1.0"
-```
-
-## Run a SQL query against data stored in a CSV:
-
-```rust
-use datafusion::prelude::*;
-
-#[tokio::main]
-async fn main() -> datafusion::error::Result<()> {
- // register the table
- let ctx = SessionContext::new();
- ctx.register_csv("example", "tests/example.csv",
CsvReadOptions::new()).await?;
-
- // create a plan to run a SQL query
- let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT
100").await?;
-
- // execute and print results
- df.show().await?;
- Ok(())
-}
-```
-
-## Use the DataFrame API to process data stored in a CSV:
-
-```rust
-use datafusion::prelude::*;
-
-#[tokio::main]
-async fn main() -> datafusion::error::Result<()> {
- // create the dataframe
- let ctx = SessionContext::new();
- let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new()).await?;
-
- let df = df.filter(col("a").lt_eq(col("b")))?
- .aggregate(vec![col("a")], vec![min(col("b"))])?;
-
- // execute and print results
- df.show_limit(100).await?;
- Ok(())
-}
-```
-
-## Output from both examples
-
-```text
-+---+--------+
-| a | MIN(b) |
-+---+--------+
-| 1 | 2 |
-+---+--------+
-```
diff --git a/docs/source/user-guide/introduction.md
b/docs/source/user-guide/introduction.md
index e1650409..77db6261 100644
--- a/docs/source/user-guide/introduction.md
+++ b/docs/source/user-guide/introduction.md
@@ -17,27 +17,34 @@
under the License.
-->
-# Introduction
+# Overview
-DataFusion is an extensible query execution framework, written in
-Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
-in-memory format.
+Ballista is a distributed compute platform primarily implemented in Rust, and
powered by Apache Arrow. It is
+built on an architecture that allows other programming languages to be
supported as first-class citizens without paying
+a penalty for serialization costs.
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+The foundational technologies in Ballista are:
-## Use Cases
+- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels
for efficient processing of data.
+- [Apache Arrow Flight
Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
for efficient data transfer between processes.
+- [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
for serializing query plans.
+- [DataFusion](https://github.com/apache/arrow-datafusion/) for query
execution.
-DataFusion is used to create modern, fast and efficient data
-pipelines, ETL processes, and database systems, which need the
-performance of Rust and Apache Arrow and want to provide their users
-the convenience of an SQL interface or a DataFrame API.
+## How does this compare to Apache Spark?
-## Why DataFusion?
+Although Ballista is largely inspired by Apache Spark, there are some key
differences.
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion
achieves very high performance
-- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet
and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design,
DataFusion can be tailored for your specific usecase
-- _High Quality_: Extensively tested, both by itself and with the rest of the
Arrow ecosystem, DataFusion can be used as the foundation for production
systems.
+- The choice of Rust as the main execution language means that memory usage is
deterministic and avoids the overhead
+ of GC pauses.
+- Ballista is designed from the ground up to use columnar data, enabling a
number of efficiencies such as vectorized
+ processing (SIMD and GPU) and efficient compression. Although Spark does
have some columnar support, it is still
+ largely row-based today.
+- The combination of Rust and Arrow provides excellent memory efficiency and
memory usage can be 5x - 10x lower than
+ Apache Spark in some cases, which means that more processing can fit on a
single node, reducing the overhead of
+ distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that
data can be exchanged between executors
+ in any programming language with minimal serialization overhead.
+
+## Status
+
+Ballista is still in the early stages of development but is capable of
executing complex analytical queries at scale.
diff --git a/docs/source/user-guide/library.md
b/docs/source/user-guide/library.md
deleted file mode 100644
index 422c9d6d..00000000
--- a/docs/source/user-guide/library.md
+++ /dev/null
@@ -1,112 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Using DataFusion as a library
-
-## Create a new project
-
-```shell
-cargo new hello_datafusion
-```
-
-```shell
-$ cd hello_datafusion
-$ tree .
-.
-├── Cargo.toml
-└── src
- └── main.rs
-
-1 directory, 2 files
-```
-
-## Default Configuration
-
-DataFusion is [published on crates.io](https://crates.io/crates/datafusion),
and is [well documented on docs.rs](https://docs.rs/datafusion/).
-
-To get started, add the following to your `Cargo.toml` file:
-
-```toml
-[dependencies]
-datafusion = "8.0.0"
-```
-
-## Create a main function
-
-Update the main.rs file with your first datafusion application based on
[Example
usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)
-
-```rust
-use datafusion::prelude::*;
-
-#[tokio::main]
-async fn main() -> datafusion::error::Result<()> {
- // register the table
- let ctx = SessionContext::new();
- ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>",
CsvReadOptions::new()).await?;
-
- // create a plan to run a SQL query
- let df = ctx.sql("SELECT * FROM test").await?;
-
- // execute and print results
- df.show().await?;
- Ok(())
-}
-```
-
-## Optimized Configuration
-
-For an optimized build several steps are required. First, use the below in
your `Cargo.toml`. It is
-worth noting that using the settings in the `[profile.release]` section will
significantly increase the build time.
-
-```toml
-[dependencies]
-datafusion = { version = "7.0" , features = ["simd"]}
-tokio = { version = "^1.0", features = ["rt-multi-thread"] }
-snmalloc-rs = "0.2"
-
-[profile.release]
-lto = true
-codegen-units = 1
-```
-
-Then, in `main.rs.` update the memory allocator with the below after your
imports:
-
-```rust
-use datafusion::prelude::*;
-
-#[global_allocator]
-static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
-
-async fn main() -> datafusion::error::Result<()> {
- ...
-}
-```
-
-Finally, in order to build with the `simd` optimization `cargo nightly` is
required.
-
-```shell
-rustup toolchain install nightly
-```
-
-Based on the instruction set architecture you are building on you will want to
configure the `target-cpu` as well, ideally
-with `native` or at least `avx2`.
-
-```
-RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release
-```
diff --git a/docs/source/user-guide/distributed/clients/python.md
b/docs/source/user-guide/python.md
similarity index 80%
rename from docs/source/user-guide/distributed/clients/python.md
rename to docs/source/user-guide/python.md
index dac06408..0d42c450 100644
--- a/docs/source/user-guide/distributed/clients/python.md
+++ b/docs/source/user-guide/python.md
@@ -19,4 +19,14 @@
# Python
-Coming soon.
+```text
+>>> import ballista
+>>> ctx = ballista.BallistaContext("localhost", 50050)
+>>> df = ctx.sql("SELECT 1")
+>>> df.show()
++----------+
+| Int64(1) |
++----------+
+| 1 |
++----------+
+```
diff --git a/docs/source/user-guide/distributed/clients/rust.md
b/docs/source/user-guide/rust.md
similarity index 100%
rename from docs/source/user-guide/distributed/clients/rust.md
rename to docs/source/user-guide/rust.md
diff --git a/docs/source/user-guide/sql/aggregate_functions.md
b/docs/source/user-guide/sql/aggregate_functions.md
deleted file mode 100644
index d3472a7f..00000000
--- a/docs/source/user-guide/sql/aggregate_functions.md
+++ /dev/null
@@ -1,62 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Aggregate Functions
-
-Aggregate functions operate on a set of values to compute a single result.
Please refer to
[PostgreSQL](https://www.postgresql.org/docs/current/functions-aggregate.html)
for usage of standard SQL functions.
-
-## General
-
-- min
-- max
-- count
-- avg
-- sum
-- array_agg
-
-## Statistical
-
-- var / var_samp / var_pop
-- stddev / stddev_samp / stddev_pop
-- covar / covar_samp / covar_pop
-- corr
-
-## Approximate
-
-### approx_distinct
-
-`approx_distinct(x) -> uint64` returns the approximate number (HyperLogLog) of
distinct input values
-
-### approx_median
-
-`approx_median(x) -> x` returns the approximate median of input values. it is
an alias of `approx_percentile_cont(x, 0.5)`.
-
-### approx_percentile_cont
-
-`approx_percentile_cont(x, p) -> x` return the approximate percentile
(TDigest) of input values, where `p` is a float64 between 0 and 1 (inclusive).
-
-It supports raw data as input and build Tdigest sketches during query time,
and is approximately equal to `approx_percentile_cont_with_weight(x, 1, p)`.
-
-### approx_percentile_cont_with_weight
-
-`approx_percentile_cont_with_weight(x, w, p) -> x` returns the approximate
percentile (TDigest) of input values with weight, where `w` is weight column
expression and `p` is a float64 between 0 and 1 (inclusive).
-
-It supports raw data as input or pre-aggregated TDigest sketches, then builds
or merges Tdigest sketches during query time. TDigest sketches are a list of
centroid `(x, w)`, where `x` stands for mean and `w` stands for weight.
-
-It is suitable for low latency OLAP system where a streaming compute engine
(e.g. Spark Streaming/Flink) pre-aggregates data to a data store, then queries
using Datafusion.
diff --git a/docs/source/user-guide/sql/datafusion-functions.md
b/docs/source/user-guide/sql/datafusion-functions.md
deleted file mode 100644
index e37ba11e..00000000
--- a/docs/source/user-guide/sql/datafusion-functions.md
+++ /dev/null
@@ -1,110 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# DataFusion-Specific Functions
-
-These SQL functions are specific to DataFusion, or they are well known and
have functionality which is specific to DataFusion. Specifically, the
`to_timestamp_xx()` functions exist due to Arrow's support for multiple
timestamp resolutions.
-
-## `to_timestamp`
-
-`to_timestamp()` is similar to the standard SQL function. It performs
conversions to type `Timestamp(Nanoseconds, None)`, from:
-
-- Timestamp strings
- - `1997-01-31T09:26:56.123Z` # RCF3339
- - `1997-01-31T09:26:56.123-05:00` # RCF3339
- - `1997-01-31 09:26:56.123-05:00` # close to RCF3339 but with a space er
than T
- - `1997-01-31T09:26:56.123` # close to RCF3339 but no timezone et specified
- - `1997-01-31 09:26:56.123` # close to RCF3339 but uses a space and timezone
offset
- - `1997-01-31 09:26:56` # close to RCF3339, no fractional seconds
-- An Int64 array/column, values are nanoseconds since Epoch UTC
-- Other Timestamp() columns or values
-
-Note that conversions from other Timestamp and Int64 types can also be
performed using `CAST(.. AS Timestamp)`. However, the conversion functionality
here is present for consistency with the other `to_timestamp_xx()` functions.
-
-## `to_timestamp_millis`
-
-`to_timestamp_millis()` does conversions to type `Timestamp(Milliseconds,
None)`, from:
-
-- Timestamp strings, the same as supported by the regular timestamp() function
(except the output is a timestamp of Milliseconds resolution)
- - `1997-01-31T09:26:56.123Z` # RCF3339
- - `1997-01-31T09:26:56.123-05:00` # RCF3339
- - `1997-01-31 09:26:56.123-05:00` # close to RCF3339 but with a space er
than T
- - `1997-01-31T09:26:56.123` # close to RCF3339 but no timezone et specified
- - `1997-01-31 09:26:56.123` # close to RCF3339 but uses a space and timezone
offset
- - `1997-01-31 09:26:56` # close to RCF3339, no fractional seconds
-- An Int64 array/column, values are milliseconds since Epoch UTC
-- Other Timestamp() columns or values
-
-Note that `CAST(.. AS Timestamp)` converts to Timestamps with Nanosecond
resolution; this function is the only way to convert/cast to millisecond
resolution.
-
-## `to_timestamp_micros`
-
-`to_timestamp_micros()` does conversions to type `Timestamp(Microseconds,
None)`, from:
-
-- Timestamp strings, the same as supported by the regular timestamp() function
(except the output is a timestamp of microseconds resolution)
- - `1997-01-31T09:26:56.123Z` # RCF3339
- - `1997-01-31T09:26:56.123-05:00` # RCF3339
- - `1997-01-31 09:26:56.123-05:00` # close to RCF3339 but with a space er
than T
- - `1997-01-31T09:26:56.123` # close to RCF3339 but no timezone et specified
- - `1997-01-31 09:26:56.123` # close to RCF3339 but uses a space and timezone
offset
- - `1997-01-31 09:26:56` # close to RCF3339, no fractional seconds
-- An Int64 array/column, values are microseconds since Epoch UTC
-- Other Timestamp() columns or values
-
-Note that `CAST(.. AS Timestamp)` converts to Timestamps with Nanosecond
resolution; this function is the only way to convert/cast to microsecond
resolution.
-
-## `to_timestamp_seconds`
-
-`to_timestamp_seconds()` does conversions to type `Timestamp(Seconds, None)`,
from:
-
-- Timestamp strings, the same as supported by the regular timestamp() function
(except the output is a timestamp of secondseconds resolution)
- - `1997-01-31T09:26:56.123Z` # RCF3339
- - `1997-01-31T09:26:56.123-05:00` # RCF3339
- - `1997-01-31 09:26:56.123-05:00` # close to RCF3339 but with a space er
than T
- - `1997-01-31T09:26:56.123` # close to RCF3339 but no timezone et specified
- - `1997-01-31 09:26:56.123` # close to RCF3339 but uses a space and timezone
offset
- - `1997-01-31 09:26:56` # close to RCF3339, no fractional seconds
-- An Int64 array/column, values are seconds since Epoch UTC
-- Other Timestamp() columns or values
-
-Note that `CAST(.. AS Timestamp)` converts to Timestamps with Nanosecond
resolution; this function is the only way to convert/cast to seconds resolution.
-
-## `extract`
-
-`extract(field FROM source)`
-
-- The `extract` function retrieves subfields such as year or hour from
date/time values.
- `source` must be a value expression of type timestamp, Data32, or Data64.
`field` is an identifier that selects what field to extract from the source
value.
- The `extract` function returns values of type u32.
- - `year` :`extract(year FROM to_timestamp('2020-09-08T12:00:00+00:00')) ->
2020`
- - `month`:`extract(month FROM to_timestamp('2020-09-08T12:00:00+00:00')) ->
9`
- - `week` :`extract(week FROM to_timestamp('2020-09-08T12:00:00+00:00')) ->
37`
- - `day`: `extract(day FROM to_timestamp('2020-09-08T12:00:00+00:00')) -> 8`
- - `hour`: `extract(hour FROM to_timestamp('2020-09-08T12:00:00+00:00')) ->
12`
- - `minute`: `extract(minute FROM to_timestamp('2020-09-08T12:01:00+00:00'))
-> 1`
- - `second`: `extract(second FROM to_timestamp('2020-09-08T12:00:03+00:00'))
-> 3`
-
-## `date_part`
-
-`date_part('field', source)`
-
-- The `date_part` function is modeled on the postgres equivalent to the
SQL-standard function `extract`.
- Note that here the field parameter needs to be a string value, not a name.
- The valid field names for `date_part` are the same as for `extract`.
- - `date_part('second', to_timestamp('2020-09-08T12:00:12+00:00')) -> 12`
diff --git a/docs/source/user-guide/sql/ddl.md
b/docs/source/user-guide/sql/ddl.md
deleted file mode 100644
index 75ec0f6c..00000000
--- a/docs/source/user-guide/sql/ddl.md
+++ /dev/null
@@ -1,99 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# DDL
-
-## CREATE EXTERNAL TABLE
-
-Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE`
SQL statement. It is not necessary
-to provide schema information for Parquet files.
-
-```sql
-CREATE EXTERNAL TABLE taxi
-STORED AS PARQUET
-LOCATION '/mnt/nyctaxi/tripdata.parquet';
-```
-
-CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE`
SQL statement. It is necessary to
-provide schema information for CSV files since DataFusion does not
automatically infer the schema when using SQL
-to query CSV files.
-
-```sql
-CREATE EXTERNAL TABLE test (
- c1 VARCHAR NOT NULL,
- c2 INT NOT NULL,
- c3 SMALLINT NOT NULL,
- c4 SMALLINT NOT NULL,
- c5 INT NOT NULL,
- c6 BIGINT NOT NULL,
- c7 SMALLINT NOT NULL,
- c8 INT NOT NULL,
- c9 BIGINT NOT NULL,
- c10 VARCHAR NOT NULL,
- c11 FLOAT NOT NULL,
- c12 DOUBLE NOT NULL,
- c13 VARCHAR NOT NULL
-)
-STORED AS CSV
-WITH HEADER ROW
-LOCATION '/path/to/aggregate_test_100.csv';
-```
-
-If data sources are already partitioned in Hive style, `PARTITIONED BY` can be
used for partition pruning.
-
-```
-/mnt/nyctaxi/year=2022/month=01/tripdata.parquet
-/mnt/nyctaxi/year=2021/month=12/tripdata.parquet
-/mnt/nyctaxi/year=2021/month=11/tripdata.parquet
-```
-
-```sql
-CREATE EXTERNAL TABLE taxi
-STORED AS PARQUET
-PARTITIONED BY (year, month)
-LOCATION '/mnt/nyctaxi';
-```
-
-## CREATE MEMORY TABLE
-
-Memory table can be created with query.
-
-```
-CREATE TABLE TABLE_NAME AS [SELECT | VALUES LIST]
-```
-
-```sql
-CREATE TABLE valuetable AS VALUES(1,'HELLO'),(12,'DATAFUSION');
-
-CREATE TABLE memtable as select * from valuetable;
-```
-
-## DROP TABLE
-
-The table can be deleted.
-
-```
-DROP TABLE [ IF EXISTS ] name
-```
-
-```sql
-CREATE TABLE users AS VALUES(1,2),(2,3);
-
-DROP TABLE users;
-```
diff --git a/docs/source/user-guide/sql/index.rst
b/docs/source/user-guide/sql/index.rst
deleted file mode 100644
index f6d3a0bb..00000000
--- a/docs/source/user-guide/sql/index.rst
+++ /dev/null
@@ -1,28 +0,0 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-.. or more contributor license agreements. See the NOTICE file
-.. distributed with this work for additional information
-.. regarding copyright ownership. The ASF licenses this file
-.. to you under the Apache License, Version 2.0 (the
-.. "License"); you may not use this file except in compliance
-.. with the License. You may obtain a copy of the License at
-
-.. http://www.apache.org/licenses/LICENSE-2.0
-
-.. Unless required by applicable law or agreed to in writing,
-.. software distributed under the License is distributed on an
-.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-.. KIND, either express or implied. See the License for the
-.. specific language governing permissions and limitations
-.. under the License.
-
-SQL Reference
-=============
-
-.. toctree::
- :maxdepth: 2
-
- sql_status
- select
- ddl
- aggregate_functions
- DataFusion Functions <datafusion-functions>
diff --git a/docs/source/user-guide/sql/select.md
b/docs/source/user-guide/sql/select.md
deleted file mode 100644
index 49399c93..00000000
--- a/docs/source/user-guide/sql/select.md
+++ /dev/null
@@ -1,136 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# SELECT syntax
-
-The queries in DataFusion scan data from tables and return 0 or more rows.
-In this documentation we describe the SQL syntax in DataFusion.
-
-DataFusion supports the following syntax for queries:
-<code class="language-sql hljs">
-
-[ [WITH](#with-clause) with_query [, ...] ] <br/>
-[SELECT](#select-clause) [ ALL | DISTINCT ] select_expr [, ...] <br/>
-[ [FROM](#from-clause) from_item [, ...] ] <br/>
-[ [WHERE](#where-clause) condition ] <br/>
-[ [GROUP BY](#group-by-clause) grouping_element [, ...] ] <br/>
-[ [HAVING](#having-clause) condition] <br/>
-[ [UNION](#union-clause) [ ALL | select ] <br/>
-[ [ORDER BY](#order-by-clause) expression [ ASC | DESC ][, ...] ] <br/>
-[ [LIMIT](#limit-clause) count ] <br/>
-
-</code>
-
-## WITH clause
-
-A with clause allows to give names for queries and reference them by name.
-
-```sql
-WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
-SELECT a, b FROM x;
-```
-
-## SELECT clause
-
-Example:
-
-```sql
-SELECT a, b, a + b FROM table
-```
-
-The `DISTINCT` quantifier can be added to make the query return all distinct
rows.
-By default `ALL` will be used, which returns all the rows.
-
-```sql
-SELECT DISTINCT person, age FROM employees
-```
-
-## FROM clause
-
-Example:
-
-```sql
-SELECT t.a FROM table AS t
-```
-
-## WHERE clause
-
-Example:
-
-```sql
-SELECT a FROM table WHERE a > 10
-```
-
-## GROUP BY clause
-
-Example:
-
-```sql
-SELECT a, b, MAX(c) FROM table GROUP BY a, b
-```
-
-## HAVING clause
-
-Example:
-
-```sql
-SELECT a, b, MAX(c) FROM table GROUP BY a, b HAVING MAX(c) > 10
-```
-
-## UNION clause
-
-Example:
-
-```sql
-SELECT
- a,
- b,
- c
-FROM table1
-UNION ALL
-SELECT
- a,
- b,
- c
-FROM table2
-```
-
-## ORDER BY clause
-
-Orders the results by the referenced expression. By default it uses ascending
order (`ASC`).
-This order can be changed to descending by adding `DESC` after the order-by
expressions.
-
-Examples:
-
-```sql
-SELECT age, person FROM table ORDER BY age;
-SELECT age, person FROM table ORDER BY age DESC;
-SELECT age, person FROM table ORDER BY age, person DESC;
-```
-
-## LIMIT clause
-
-Limits the number of rows to be a maximum of `count` rows. `count` should be a
non-negative integer.
-
-Example:
-
-```sql
-SELECT age, person FROM table
-LIMIT 10
-```
diff --git a/docs/source/user-guide/sql/sql_status.md
b/docs/source/user-guide/sql/sql_status.md
deleted file mode 100644
index 8b2a3293..00000000
--- a/docs/source/user-guide/sql/sql_status.md
+++ /dev/null
@@ -1,246 +0,0 @@
-<!---
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-
-# Status
-
-## General
-
-- [x] SQL Parser
-- [x] SQL Query Planner
-- [x] Query Optimizer
-- [x] Constant folding
-- [x] Join Reordering
-- [x] Limit Pushdown
-- [x] Projection push down
-- [x] Predicate push down
-- [x] Type coercion
-- [x] Parallel query execution
-
-## SQL Support
-
-- [x] Projection
-- [x] Filter (WHERE)
-- [x] Filter post-aggregate (HAVING)
-- [x] Limit
-- [x] Aggregate
-- [x] Common math functions
-- [x] cast
-- [x] try_cast
-- [x] [`VALUES`
lists](https://www.postgresql.org/docs/current/queries-values.html)
-- Postgres compatible String functions
- - [x] ascii
- - [x] bit_length
- - [x] btrim
- - [x] char_length
- - [x] character_length
- - [x] chr
- - [x] concat
- - [x] concat_ws
- - [x] initcap
- - [x] left
- - [x] length
- - [x] lpad
- - [x] ltrim
- - [x] octet_length
- - [x] regexp_replace
- - [x] repeat
- - [x] replace
- - [x] reverse
- - [x] right
- - [x] rpad
- - [x] rtrim
- - [x] split_part
- - [x] starts_with
- - [x] strpos
- - [x] substr
- - [x] to_hex
- - [x] translate
- - [x] trim
-- Miscellaneous/Boolean functions
- - [x] nullif
-- Approximation functions
- - [x] approx_distinct
- - [x] approx_median
- - [x] approx_percentile_cont
- - [x] approx_percentile_cont_with_weight
-- Common date/time functions
- - [ ] Basic date functions
- - [ ] Basic time functions
- - [x] Basic timestamp functions
- - [x]
[to_timestamp](docs/user-guide/book/sql/datafusion-functions.html#to_timestamp)
- - [x]
[to_timestamp_millis](docs/user-guide/book/sql/datafusion-functions.html#to_timestamp_millis)
- - [x]
[to_timestamp_micros](docs/user-guide/book/sql/datafusion-functions.html#to_timestamp_micros)
- - [x]
[to_timestamp_seconds](docs/user-guide/book/sql/datafusion-functions.html#to_timestamp_seconds)
- - [x] [extract](docs/user-guide/book/sql/datafusion-functions.html#extract)
- - [x]
[date_part](docs/user-guide/book/sql/datafusion-functions.html#date_part)
-- nested functions
- - [x] Array of columns
-- [x] Schema Queries
- - [x] SHOW TABLES
- - [x] SHOW COLUMNS
- - [x] information_schema.{tables, columns}
- - [ ] information_schema other views
-- [x] Sorting
-- [ ] Nested types
-- [ ] Lists
-- [x] Subqueries
-- [x] Common table expressions
-- [x] Set Operations
- - [x] UNION ALL
- - [x] UNION
- - [x] INTERSECT
- - [x] INTERSECT ALL
- - [x] EXCEPT
- - [x] EXCEPT ALL
-- [x] Joins
- - [x] INNER JOIN
- - [x] LEFT JOIN
- - [x] RIGHT JOIN
- - [x] FULL JOIN
- - [x] CROSS JOIN
-- [ ] Window
- - [x] Empty window
- - [x] Common window functions
- - [x] Window with PARTITION BY clause
- - [x] Window with ORDER BY clause
- - [ ] Window with FILTER clause
- - [ ] [Window with custom WINDOW
FRAME](https://github.com/apache/arrow-datafusion/issues/361)
- - [ ] UDF and UDAF for window functions
-
-## Data Sources
-
-- [x] CSV
-- [x] Parquet primitive types
-- [ ] Parquet nested types
-
-## Extensibility
-
-DataFusion is designed to be extensible at all points. To that end, you can
provide your own custom:
-
-- [x] User Defined Functions (UDFs)
-- [x] User Defined Aggregate Functions (UDAFs)
-- [x] User Defined Table Source (`TableProvider`) for tables
-- [x] User Defined `Optimizer` passes (plan rewrites)
-- [x] User Defined `LogicalPlan` nodes
-- [x] User Defined `ExecutionPlan` nodes
-
-## Rust Version Compatbility
-
-This crate is tested with the latest stable version of Rust. We do not
currently test against other, older versions of the Rust compiler.
-
-# Supported SQL
-
-This library currently supports many SQL constructs, including
-
-- `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a
table's locations
-- `SELECT ... FROM ...` together with any expression
-- `ALIAS` to name an expression
-- `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
-- Many mathematical unary and binary expressions such as `+`, `/`, `sqrt`,
`tan`, `>=`.
-- `WHERE` to filter
-- `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`,
`COUNT`, `SUM`, `AVG`, `CORR`, `VAR`, `COVAR`, `STDDEV` (sample and population)
-- `ORDER BY` together with an expression and optional `ASC` or `DESC` and also
optional `NULLS FIRST` or `NULLS LAST`
-
-## Supported Functions
-
-DataFusion strives to implement a subset of the [PostgreSQL SQL
dialect](https://www.postgresql.org/docs/current/functions.html) where
possible. We explicitly choose a single dialect to maximize interoperability
with other tools and allow reuse of the PostgreSQL documents and tutorials as
much as possible.
-
-Currently, only a subset of the PostgreSQL dialect is implemented, and we will
document any deviations.
-
-## Schema Metadata / Information Schema Support
-
-DataFusion supports the showing metadata about the tables available. This
information can be accessed using the views of the ISO SQL `information_schema`
schema or the DataFusion specific `SHOW TABLES` and `SHOW COLUMNS` commands.
-
-More information can be found in the [Postgres
docs](https://www.postgresql.org/docs/13/infoschema-schema.html)).
-
-To show tables available for use in DataFusion, use the `SHOW TABLES` command
or the `information_schema.tables` view:
-
-```sql
-> show tables;
-+---------------+--------------------+------------+------------+
-| table_catalog | table_schema | table_name | table_type |
-+---------------+--------------------+------------+------------+
-| datafusion | public | t | BASE TABLE |
-| datafusion | information_schema | tables | VIEW |
-+---------------+--------------------+------------+------------+
-
-> select * from information_schema.tables;
-
-+---------------+--------------------+------------+--------------+
-| table_catalog | table_schema | table_name | table_type |
-+---------------+--------------------+------------+--------------+
-| datafusion | public | t | BASE TABLE |
-| datafusion | information_schema | TABLES | SYSTEM TABLE |
-+---------------+--------------------+------------+--------------+
-```
-
-To show the schema of a table in DataFusion, use the `SHOW COLUMNS` command or
the or `information_schema.columns` view:
-
-```sql
-> show columns from t;
-+---------------+--------------+------------+-------------+-----------+-------------+
-| table_catalog | table_schema | table_name | column_name | data_type |
is_nullable |
-+---------------+--------------+------------+-------------+-----------+-------------+
-| datafusion | public | t | a | Int32 | NO
|
-| datafusion | public | t | b | Utf8 | NO
|
-| datafusion | public | t | c | Float32 | NO
|
-+---------------+--------------+------------+-------------+-----------+-------------+
-
-> select table_name, column_name, ordinal_position, is_nullable, data_type
from information_schema.columns;
-+------------+-------------+------------------+-------------+-----------+
-| table_name | column_name | ordinal_position | is_nullable | data_type |
-+------------+-------------+------------------+-------------+-----------+
-| t | a | 0 | NO | Int32 |
-| t | b | 1 | NO | Utf8 |
-| t | c | 2 | NO | Float32 |
-+------------+-------------+------------------+-------------+-----------+
-```
-
-## Supported Data Types
-
-DataFusion uses Arrow, and thus the Arrow type system, for query
-execution. The SQL types from
-[sqlparser-rs](https://github.com/ballista-compute/sqlparser-rs/blob/main/src/ast/data_type.rs#L57)
-are mapped to Arrow types according to the following table
-
-| SQL Data Type | Arrow DataType |
-| ------------- | --------------------------------- |
-| `CHAR` | `Utf8` |
-| `VARCHAR` | `Utf8` |
-| `UUID` | _Not yet supported_ |
-| `CLOB` | _Not yet supported_ |
-| `BINARY` | _Not yet supported_ |
-| `VARBINARY` | _Not yet supported_ |
-| `DECIMAL` | `Float64` |
-| `FLOAT` | `Float32` |
-| `SMALLINT` | `Int16` |
-| `INT` | `Int32` |
-| `BIGINT` | `Int64` |
-| `REAL` | `Float32` |
-| `DOUBLE` | `Float64` |
-| `BOOLEAN` | `Boolean` |
-| `DATE` | `Date32` |
-| `TIME` | `Time64(TimeUnit::Millisecond)` |
-| `TIMESTAMP` | `Timestamp(TimeUnit::Nanosecond)` |
-| `INTERVAL` | _Not yet supported_ |
-| `REGCLASS` | _Not yet supported_ |
-| `TEXT` | _Not yet supported_ |
-| `BYTEA` | _Not yet supported_ |
-| `CUSTOM` | _Not yet supported_ |
-| `ARRAY` | _Not yet supported_ |