[arrow-datafusion-python] branch main updated: enhancement(docs): Add user guide (#432) (#445)

alamb Wed, 09 Aug 2023 17:03:44 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion-python.git



The following commit(s) were added to refs/heads/main by this push:
     new e34d203  enhancement(docs): Add user guide (#432) (#445)
e34d203 is described below

commit e34d203fd63760f8660adee251817df2afb542b0
Author: Daniel Mesejo <[email protected]>
AuthorDate: Thu Aug 10 02:03:34 2023 +0200

    enhancement(docs): Add user guide (#432) (#445)
---
 docs/requirements.txt                              |   4 +-
 docs/source/api.rst                                |   1 -
 docs/source/conf.py                                |   1 +
 docs/source/contributor-guide/introduction.rst     |  85 +++++++
 docs/source/index.rst                              | 263 +++------------------
 docs/source/user-guide/basics.rst                  |  89 +++++++
 .../user-guide/common-operations/aggregations.rst  |  60 +++++
 .../user-guide/common-operations/basic-info.rst    |  61 +++++
 .../user-guide/common-operations/functions.rst     | 117 +++++++++
 .../common-operations/index.rst}                   |  20 +-
 docs/source/user-guide/common-operations/joins.rst | 104 ++++++++
 .../common-operations/select-and-filter.rst        |  67 ++++++
 .../user-guide/common-operations/udf-and-udfa.rst  |  85 +++++++
 .../user-guide/common-operations/windows.rst       |  93 ++++++++
 .../{api.rst => user-guide/introduction.rst}       |  35 ++-
 docs/source/{api.rst => user-guide/io/avro.rst}    |  22 +-
 docs/source/{api.rst => user-guide/io/csv.rst}     |  28 ++-
 docs/source/{api.rst => user-guide/io/index.rst}   |  18 +-
 docs/source/{api.rst => user-guide/io/json.rst}    |  21 +-
 docs/source/{api.rst => user-guide/io/parquet.rst} |  28 ++-
 docs/source/{api.rst => user-guide/sql.rst}        |  31 ++-
 21 files changed, 913 insertions(+), 320 deletions(-)

diff --git a/docs/requirements.txt b/docs/requirements.txt
index 8eb7449..eb7f5b7 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -19,4 +19,6 @@ sphinx==5.3.0
 pydata-sphinx-theme==0.8.0
 myst-parser
 maturin
-jinja2
\ No newline at end of file
+jinja2
+ipython
+pandas
\ No newline at end of file
diff --git a/docs/source/api.rst b/docs/source/api.rst
index a5d6543..d9f4a09 100644
--- a/docs/source/api.rst
+++ b/docs/source/api.rst
@@ -24,7 +24,6 @@ API Reference
 .. toctree::
    :maxdepth: 2
 
-   api/config
    api/dataframe
    api/execution_context
    api/expression
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 929c249..0822e0a 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -52,6 +52,7 @@ extensions = [
     "sphinx.ext.viewcode",
     "sphinx.ext.napoleon",
     "myst_parser",
+    "IPython.sphinxext.ipython_directive",
 ]
 
 source_suffix = {
diff --git a/docs/source/contributor-guide/introduction.rst 
b/docs/source/contributor-guide/introduction.rst
new file mode 100644
index 0000000..dd61ad8
--- /dev/null
+++ b/docs/source/contributor-guide/introduction.rst
@@ -0,0 +1,85 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Introduction
+============
+We welcome and encourage contributions of all kinds, such as:
+
+1. Tickets with issue reports of feature requests
+2. Documentation improvements
+3. Code, both PR and (especially) PR Review.
+
+In addition to submitting new PRs, we have a healthy tradition of community 
members reviewing each other’s PRs.
+Doing so is a great way to help the community as well as get more familiar 
with Rust and the relevant codebases.
+
+How to develop
+--------------
+
+This assumes that you have rust and cargo installed. We use the workflow 
recommended by `pyo3 <https://github.com/PyO3/pyo3>`_ and `maturin 
<https://github.com/PyO3/maturin>`_.
+
+Bootstrap:
+
+.. code-block:: shell
+
+    # fetch this repo
+    git clone [email protected]:apache/arrow-datafusion-python.git
+    # prepare development environment (used to build wheel / install in 
development)
+    python3 -m venv venv
+    # activate the venv
+    source venv/bin/activate
+    # update pip itself if necessary
+    python -m pip install -U pip
+    # install dependencies (for Python 3.8+)
+    python -m pip install -r requirements-310.txt
+
+The tests rely on test data in git submodules.
+
+.. code-block:: shell
+
+    git submodule init
+    git submodule update
+
+
+Whenever rust code changes (your changes or via `git pull`):
+
+.. code-block:: shell
+
+   # make sure you activate the venv using "source venv/bin/activate" first
+   maturin develop
+   python -m pytest
+
+
+Update Dependencies
+-------------------
+
+To change test dependencies, change the `requirements.in` and run
+
+.. code-block:: shell
+
+    # install pip-tools (this can be done only once), also consider running in 
venv
+    python -m pip install pip-tools
+    python -m piptools compile --generate-hashes -o requirements-310.txt
+
+
+To update dependencies, run with `-U`
+
+.. code-block:: shell
+
+   python -m piptools compile -U --generate-hashes -o requirements-310.txt
+
+
+More details about pip-tools `here <https://github.com/jazzband/pip-tools>`_
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 78f44ea..155b0cf 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -31,12 +31,17 @@ Its query engine, DataFusion, is written in `Rust 
<https://www.rust-lang.org>`_,
 
 Technically, zero-copy is achieved via the `c data interface 
<https://arrow.apache.org/docs/format/CDataInterface.html>`_.
 
-How to use it
-=============
+Install
+-------
 
-Simple usage:
+.. code-block:: shell
+
+    pip install datafusion
 
-.. code-block:: python
+Example
+-------
+
+.. ipython:: python
 
     import datafusion
     from datafusion import col
@@ -50,7 +55,7 @@ Simple usage:
         [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
         names=["a", "b"],
     )
-    df = ctx.create_dataframe([[batch]])
+    df = ctx.create_dataframe([[batch]], name="batch_array")
 
     # create a new statement
     df = df.select(
@@ -58,234 +63,44 @@ Simple usage:
         col("a") - col("b"),
     )
 
-    # execute and collect the first (and only) batch
-    result = df.collect()[0]
-
-    assert result.column(0) == pyarrow.array([5, 7, 9])
-    assert result.column(1) == pyarrow.array([-3, -3, -3])
-
-
-We can also execute a query against data stored in CSV 
-
-.. code-block:: bash
-
-    echo "a,b\n1,4\n2,5\n3,6" > example.csv
-
-
-.. code-block:: python
-
-    import datafusion
-    from datafusion import col
-    import pyarrow
-
-    # create a context
-    ctx = datafusion.SessionContext()
-
-    # register a CSV
-    ctx.register_csv('example', 'example.csv')
-  
-    # create a new statement
-    df = ctx.table('example').select(
-        col("a") + col("b"),
-        col("a") - col("b"),
-    )
-
-    # execute and collect the first (and only) batch
-    result = df.collect()[0]
-
-    assert result.column(0) == pyarrow.array([5, 7, 9])
-    assert result.column(1) == pyarrow.array([-3, -3, -3])
-
-
-And how to execute a query against a CSV using SQL: 
-
-
-.. code-block:: python
-
-    import datafusion
-    from datafusion import col
-    import pyarrow
-
-    # create a context
-    ctx = datafusion.SessionContext()
-
-    # register a CSV
-    ctx.register_csv('example', 'example.csv')
-
-    # create a new statement via SQL
-    df = ctx.sql("SELECT a+b, a-b FROM example")
-
-    # execute and collect the first (and only) batch
-    result = df.collect()[0]
-
-    assert result.column(0) == pyarrow.array([5, 7, 9])
-    assert result.column(1) == pyarrow.array([-3, -3, -3])
-
-
-
-UDFs
-----
-
-.. code-block:: python
-
-    import pyarrow
-    from datafusion import udf
-
-    def is_null(array: pyarrow.Array) -> pyarrow.Array:
-        return array.is_null()
-
-    is_null_arr = udf(is_null, [pyarrow.int64()], pyarrow.bool_(), 'stable')
-
-    # create a context
-    ctx = datafusion.SessionContext()
-
-    # create a RecordBatch and a new DataFrame from it
-    batch = pyarrow.RecordBatch.from_arrays(
-        [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
-        names=["a", "b"],
-    )
-    df = ctx.create_dataframe([[batch]])
-
-    df = df.select(is_null_arr(col("a")))
-
-    result = df.collect()[0]
-
-    assert result.column(0) == pyarrow.array([False] * 3)
-
+    df
 
-UDAF
-----
-
-.. code-block:: python
-
-    import pyarrow
-    import pyarrow.compute
-    import datafusion
-    from datafusion import udaf, Accumulator
-    from datafusion import col
-
-
-    class MyAccumulator(Accumulator):
-        """
-        Interface of a user-defined accumulation.
-        """
-        def __init__(self):
-            self._sum = pyarrow.scalar(0.0)
-
-        def update(self, values: pyarrow.Array) -> None:
-            # not nice since pyarrow scalars can't be summed yet. This breaks 
on `None`
-            self._sum = pyarrow.scalar(self._sum.as_py() + 
pyarrow.compute.sum(values).as_py())
-
-        def merge(self, states: pyarrow.Array) -> None:
-            # not nice since pyarrow scalars can't be summed yet. This breaks 
on `None`
-            self._sum = pyarrow.scalar(self._sum.as_py() + 
pyarrow.compute.sum(states).as_py())
-
-        def state(self) -> pyarrow.Array:
-            return pyarrow.array([self._sum.as_py()])
-
-        def evaluate(self) -> pyarrow.Scalar:
-            return self._sum
-
-    # create a context
-    ctx = datafusion.SessionContext()
-
-    # create a RecordBatch and a new DataFrame from it
-    batch = pyarrow.RecordBatch.from_arrays(
-        [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
-        names=["a", "b"],
-    )
-    df = ctx.create_dataframe([[batch]])
-
-    my_udaf = udaf(MyAccumulator, pyarrow.float64(), pyarrow.float64(), 
[pyarrow.float64()], 'stable')
-
-    df = df.aggregate(
-        [],
-        [my_udaf(col("a"))]
-    )
-
-    result = df.collect()[0]
-
-    assert result.column(0) == pyarrow.array([6.0])
-
-How to install (from pip)
-=========================
-
-.. code-block:: shell
-
-   pip install datafusion
-
-You can verify the installation by running:
-
-.. code-block:: python
-
-    >>> import datafusion
-    >>> datafusion.__version__
-    '0.6.0'
-
-
-How to develop
-==============
-
-This assumes that you have rust and cargo installed. We use the workflow 
recommended by `pyo3 <https://github.com/PyO3/pyo3>`_ and `maturin 
<https://github.com/PyO3/maturin>`_.
-
-Bootstrap:
-
-.. code-block:: shell
-
-    # fetch this repo
-    git clone [email protected]:apache/arrow-datafusion-python.git
-    # prepare development environment (used to build wheel / install in 
development)
-    python3 -m venv venv
-    # activate the venv
-    source venv/bin/activate
-    # update pip itself if necessary
-    python -m pip install -U pip
-    # install dependencies (for Python 3.8+)
-    python -m pip install -r requirements-310.txt
-
-The tests rely on test data in git submodules.
-
-.. code-block:: shell
-
-    git submodule init
-    git submodule update
-
-
-Whenever rust code changes (your changes or via `git pull`):
-
-.. code-block:: shell
-
-   # make sure you activate the venv using "source venv/bin/activate" first
-   maturin develop
-   python -m pytest
-
-
-How to update dependencies
-==========================
-
-To change test dependencies, change the `requirements.in` and run
-
-.. code-block:: shell
-
-    # install pip-tools (this can be done only once), also consider running in 
venv
-    python -m pip install pip-tools
-    python -m piptools compile --generate-hashes -o requirements-310.txt
 
+.. _toc.links:
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+   :caption: LINKS
 
-To update dependencies, run with `-U`
+   Github and Issue Tracker <https://github.com/apache/arrow-datafusion-python>
+   Rust's API Docs <https://docs.rs/datafusion/latest/datafusion/>
+   Code of conduct 
<https://github.com/apache/arrow-datafusion/blob/main/CODE_OF_CONDUCT.md>
 
-.. code-block:: shell
-
-   python -m piptools compile -U --generate-hashes -o requirements-310.txt
+.. _toc.guide:
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+   :caption: USER GUIDE
 
+   user-guide/introduction
+   user-guide/basics
+   user-guide/common-operations/index
+   user-guide/io/index
+   user-guide/sql
 
-More details about pip-tools `here <https://github.com/jazzband/pip-tools>`_
 
+.. _toc.contributor_guide:
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+   :caption: CONTRIBUTOR GUIDE
 
-API reference
-=============
+   contributor-guide/introduction
 
+.. _toc.api:
 .. toctree::
-   :maxdepth: 2
+   :hidden:
+   :maxdepth: 1
+   :caption: API
 
    api
diff --git a/docs/source/user-guide/basics.rst 
b/docs/source/user-guide/basics.rst
new file mode 100644
index 0000000..438b231
--- /dev/null
+++ b/docs/source/user-guide/basics.rst
@@ -0,0 +1,89 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Concepts
+========
+
+In this section, we will cover a basic example to introduce a few key concepts.
+
+.. code-block:: python
+
+    import datafusion
+    from datafusion import col
+    import pyarrow
+
+    # create a context
+    ctx = datafusion.SessionContext()
+
+    # create a RecordBatch and a new DataFrame from it
+    batch = pyarrow.RecordBatch.from_arrays(
+        [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
+        names=["a", "b"],
+    )
+    df = ctx.create_dataframe([[batch]])
+
+    # create a new statement
+    df = df.select(
+        col("a") + col("b"),
+        col("a") - col("b"),
+    )
+
+    # execute and collect the first (and only) batch
+    result = df.collect()[0]
+
+The first statement group:
+
+.. code-block:: python
+
+    # create a context
+    ctx = datafusion.SessionContext()
+
+creates a :code:`SessionContext`, that is, the main interface for executing 
queries with DataFusion. It maintains the state
+of the connection between a user and an instance of the DataFusion engine. 
Additionally it provides the following functionality:
+
+- Create a DataFrame from a CSV or Parquet data source.
+- Register a CSV or Parquet data source as a table that can be referenced from 
a SQL query.
+- Register a custom data source that can be referenced from a SQL query.
+- Execute a SQL query
+
+The second statement group creates a :code:`DataFrame`,
+
+.. code-block:: python
+
+    # create a RecordBatch and a new DataFrame from it
+    batch = pyarrow.RecordBatch.from_arrays(
+        [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
+        names=["a", "b"],
+    )
+    df = ctx.create_dataframe([[batch]])
+
+A DataFrame refers to a (logical) set of rows that share the same column 
names, similar to a `Pandas DataFrame 
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_.
+DataFrames are typically created by calling a method on 
:code:`SessionContext`, such as :code:`read_csv`, and can then be modified by
+calling the transformation methods, such as :meth:`.DataFrame.filter`, 
:meth:`.DataFrame.select`, :meth:`.DataFrame.aggregate`,
+and :meth:`.DataFrame.limit` to build up a query definition.
+
+The third statement uses :code:`Expressions` to build up a query definition.
+
+.. code-block:: python
+
+    df = df.select(
+        col("a") + col("b"),
+        col("a") - col("b"),
+    )
+
+Finally the :code:`collect` method converts the logical plan represented by 
the DataFrame into a physical plan and execute it,
+collecting all results into a list of `RecordBatch 
<https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_.
\ No newline at end of file
diff --git a/docs/source/user-guide/common-operations/aggregations.rst 
b/docs/source/user-guide/common-operations/aggregations.rst
new file mode 100644
index 0000000..235d644
--- /dev/null
+++ b/docs/source/user-guide/common-operations/aggregations.rst
@@ -0,0 +1,60 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Aggregation
+============
+
+An aggregate or aggregation is a function where the values of multiple rows 
are processed together to form a single summary value.
+For performing an aggregation, DataFusion provides the 
:meth:`.DataFrame.aggregate`
+
+.. ipython:: python
+
+    from datafusion import SessionContext
+    from datafusion import column, lit
+    from datafusion import functions as f
+    import random
+
+    ctx = SessionContext()
+    df = ctx.from_pydict(
+        {
+            "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
+            "b": ["one", "one", "two", "three", "two", "two", "one", "three"],
+            "c": [random.randint(0, 100) for _ in range(8)],
+            "d": [random.random() for _ in range(8)],
+        },
+        name="foo_bar"
+    )
+
+    col_a = column("a")
+    col_b = column("b")
+    col_c = column("c")
+    col_d = column("d")
+
+    df.aggregate([], [f.approx_distinct(col_c), f.approx_median(col_d), 
f.approx_percentile_cont(col_d, lit(0.5))])
+
+When the :code:`group_by` list is empty the aggregation is done over the whole 
:class:`.DataFrame`. For grouping
+the :code:`group_by` list must contain at least one column
+
+.. ipython:: python
+
+    df.aggregate([col_a], [f.sum(col_c), f.max(col_d), f.min(col_d)])
+
+More than one column can be used for grouping
+
+.. ipython:: python
+
+    df.aggregate([col_a, col_b], [f.sum(col_c), f.max(col_d), f.min(col_d)])
diff --git a/docs/source/user-guide/common-operations/basic-info.rst 
b/docs/source/user-guide/common-operations/basic-info.rst
new file mode 100644
index 0000000..424e1cc
--- /dev/null
+++ b/docs/source/user-guide/common-operations/basic-info.rst
@@ -0,0 +1,61 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Basic Operations
+================
+
+In this section, you will learn how to display essential details of DataFrames 
using specific functions.
+
+.. ipython:: python
+    
+    from datafusion import SessionContext
+    import random
+    
+    ctx = SessionContext()
+    df = ctx.from_pydict({
+        "nrs": [1, 2, 3, 4, 5],
+        "names": ["python", "ruby", "java", "haskell", "go"],
+        "random": random.sample(range(1000), 5),
+        "groups": ["A", "A", "B", "C", "B"],
+    })
+    df
+
+Use :meth:`.DataFrame.limit` to view the top rows of the frame:
+
+.. ipython:: python
+
+    df.limit(2)
+
+Display the columns of the DataFrame using :meth:`.DataFrame.schema`:
+
+.. ipython:: python
+
+    df.schema()
+
+The method :meth:`.DataFrame.to_pandas` uses pyarrow to convert to pandas 
DataFrame, by collecting the batches,
+passing them to an Arrow table, and then converting them to a pandas DataFrame.
+
+.. ipython:: python
+
+    df.to_pandas()
+
+:meth:`.DataFrame.describe` shows a quick statistic summary of your data:
+
+.. ipython:: python
+
+    df.describe()
+
diff --git a/docs/source/user-guide/common-operations/functions.rst 
b/docs/source/user-guide/common-operations/functions.rst
new file mode 100644
index 0000000..7e5c592
--- /dev/null
+++ b/docs/source/user-guide/common-operations/functions.rst
@@ -0,0 +1,117 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Functions
+=========
+
+DataFusion provides a large number of built-in functions for performing 
complex queries without requiring user-defined functions.
+In here we will cover some of the more popular use cases. If you want to view 
all the functions go to the :ref:`Functions` API Reference.
+
+We'll use the pokemon dataset in the following examples.
+
+.. ipython:: python
+
+    import urllib.request
+    from datafusion import SessionContext
+
+    urllib.request.urlretrieve(
+    
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv";,
+    "pokemon.csv",
+    )
+
+    ctx = SessionContext()
+    ctx.register_csv("pokemon", "pokemon.csv")
+    df = ctx.table("pokemon")
+
+Mathematical
+------------
+
+DataFusion offers mathematical functions such as :func:`.pow` or :func:`.log`
+
+.. ipython:: python
+
+    from datafusion import col, literal
+    from datafusion import functions as f
+
+    df.select(
+        f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), 
literal(2))
+    ).limit(10)
+
+
+Conditional
+-----------
+
+There 3 conditional functions in DataFusion :func:`.coalesce`, :func:`.nullif` 
and :func:`.case` (not available in Python)
+
+.. ipython:: python
+
+    df.select(
+        f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
+    ).limit(10)
+
+Temporal
+--------
+
+For selecting the current time use :func:`.now`
+
+.. ipython:: python
+
+    df.select(f.now())
+
+Convert to timestamps using :func:`.to_timestamp`
+
+.. ipython:: python
+
+    df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
+
+String
+------
+
+In the field of data science, working with textual data is a common task. To 
make string manipulation easier,
+DataFusion offers a range of helpful options.
+
+.. ipython:: python
+
+    df.select(
+        f.char_length(col('"Name"')).alias("len"),
+        f.lower(col('"Name"')).alias("lower"),
+        f.left(col('"Name"'), literal(4)).alias("code")
+    )
+
+This also includes the functions for regular expressions 
:func:`.regexp_replace` and :func:`.regexp_match`
+
+.. ipython:: python
+
+    df.select(
+        f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
+        f.regexp_replace(col('"Name"'), literal("saur"), 
literal("fleur")).alias("flowers")
+    )
+
+
+Other
+-----
+
+The function :func:`.in_list` allows to check a column for the presence of 
multiple values:
+
+.. ipython:: python
+
+    types = [literal("Grass"), literal("Fire"), literal("Water")]
+    (
+        df.select(f.in_list(col('"Type 1"'), types, 
negated=False).alias("basic_types"))
+          .limit(20)
+          .to_pandas()
+    )
diff --git a/docs/source/api.rst 
b/docs/source/user-guide/common-operations/index.rst
similarity index 83%
copy from docs/source/api.rst
copy to docs/source/user-guide/common-operations/index.rst
index a5d6543..950afb9 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/common-operations/index.rst
@@ -15,18 +15,16 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
-
-*************
-API Reference
-*************
+Common Operations
+=================
 
 .. toctree::
    :maxdepth: 2
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
+   basic-info
+   select-and-filter
+   joins
+   functions
+   aggregations
+   windows
+   udf-and-udfa
diff --git a/docs/source/user-guide/common-operations/joins.rst 
b/docs/source/user-guide/common-operations/joins.rst
new file mode 100644
index 0000000..1282031
--- /dev/null
+++ b/docs/source/user-guide/common-operations/joins.rst
@@ -0,0 +1,104 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Joins
+=====
+
+DataFusion supports the following join variants via the method 
:meth:`.DataFrame.join`
+
+- Inner Join
+- Left Join
+- Right Join
+- Full Join
+- Left Semi Join
+- Left Anti Join
+
+For the examples in this section we'll use the following two DataFrames
+
+.. ipython:: python
+
+    from datafusion import SessionContext
+
+    ctx = SessionContext()
+
+    left = ctx.from_pydict(
+        {
+            "customer_id": [1, 2, 3],
+            "customer": ["Alice", "Bob", "Charlie"],
+        }
+    )
+
+    right = ctx.from_pylist([
+        {"id": 1, "name": "CityCabs"},
+        {"id": 2, "name": "MetroRide"},
+        {"id": 5, "name": "UrbanGo"},
+    ])
+
+Inner Join
+----------
+
+When using an inner join, only rows containing the common values between the 
two join columns present in both DataFrames
+will be included in the resulting DataFrame.
+
+.. ipython:: python
+
+    left.join(right, join_keys=(["customer_id"], ["id"]), how="inner")
+
+The parameter :code:`join_keys` specifies the columns from the left DataFrame 
and right DataFrame that contains the values
+that should match.
+
+Left Join
+---------
+
+A left join combines rows from two DataFrames using the key columns. It 
returns all rows from the left DataFrame and
+matching rows from the right DataFrame. If there's no match in the right 
DataFrame, it returns null
+values for the corresponding columns.
+
+.. ipython:: python
+
+    left.join(right, join_keys=(["customer_id"], ["id"]), how="left")
+
+Full Join
+---------
+
+A full join merges rows from two tables based on a related column, returning 
all rows from both tables, even if there
+is no match. Unmatched rows will have null values.
+
+.. ipython:: python
+
+    left.join(right, join_keys=(["customer_id"], ["id"]), how="full")
+
+Left Semi Join
+--------------
+
+A left semi join retrieves matching rows from the left table while
+omitting duplicates with multiple matches in the right table.
+
+.. ipython:: python
+
+    left.join(right, join_keys=(["customer_id"], ["id"]), how="semi")
+
+Left Anti Join
+--------------
+
+A left anti join shows all rows from the left table without any matching rows 
in the right table,
+based on a the specified matching columns. It excludes rows from the left 
table that have at least one matching row in
+the right table.
+
+.. ipython:: python
+
+    left.join(right, join_keys=(["customer_id"], ["id"]), how="anti")
\ No newline at end of file
diff --git a/docs/source/user-guide/common-operations/select-and-filter.rst 
b/docs/source/user-guide/common-operations/select-and-filter.rst
new file mode 100644
index 0000000..8ede230
--- /dev/null
+++ b/docs/source/user-guide/common-operations/select-and-filter.rst
@@ -0,0 +1,67 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Column Selections
+=================
+
+Use :meth:`.DataFrame.select_columns`  for basic column selection.
+
+DataFusion can work with several file types, to start simple we can use a 
subset of the 
+`TLC Trip Record Data 
<https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_
+
+.. ipython:: python
+    
+    import urllib.request
+    from datafusion import SessionContext
+
+    
urllib.request.urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet";,
+                               "yellow_trip_data.parquet")
+    
+    ctx = SessionContext()
+    df = ctx.read_parquet("yellow_trip_data.parquet")
+    df.select_columns("trip_distance", "passenger_count")
+
+For mathematical or logical operations use :func:`.col` to select columns, and 
give meaningful names to the resulting
+operations using :func:`.alias`
+
+
+.. ipython:: python
+    
+    from datafusion import col, lit
+    df.select((col("tip_amount") + 
col("tolls_amount")).alias("tips_plus_tolls"))
+
+.. warning::
+
+    Please be aware that all identifiers are effectively made lower-case in 
SQL, so if your file has capital letters
+    (ex: Name) you must put your column name in double quotes or the selection 
won’t work. As an alternative for simple
+    column selection use :meth:`.DataFrame.select_columns` without double 
quotes
+
+For selecting columns with capital letters use ``'"VendorID"'``
+
+.. ipython:: python
+
+    df.select(col('"VendorID"'))
+
+
+To combine it with literal values use the :func:`.lit`
+
+.. ipython:: python
+
+    large_trip_distance = col("trip_distance") > lit(5.0)
+    low_passenger_count = col("passenger_count") < lit(4)
+    df.select((large_trip_distance & 
low_passenger_count).alias("lonely_trips"))
+
diff --git a/docs/source/user-guide/common-operations/udf-and-udfa.rst 
b/docs/source/user-guide/common-operations/udf-and-udfa.rst
new file mode 100644
index 0000000..62d249c
--- /dev/null
+++ b/docs/source/user-guide/common-operations/udf-and-udfa.rst
@@ -0,0 +1,85 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+User Defined Functions
+======================
+
+DataFusion provides powerful expressions and functions, reducing the need for 
custom Python functions.
+However you can still incorporate your own functions, i.e. User-Defined 
Functions (UDFs), with the :func:`.udf` function.
+
+.. ipython:: python
+
+    import pyarrow
+    import datafusion
+    from datafusion import udf, col
+
+    def is_null(array: pyarrow.Array) -> pyarrow.Array:
+        return array.is_null()
+
+    is_null_arr = udf(is_null, [pyarrow.int64()], pyarrow.bool_(), 'stable')
+
+    ctx = datafusion.SessionContext()
+
+    batch = pyarrow.RecordBatch.from_arrays(
+        [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
+        names=["a", "b"],
+    )
+    df = ctx.create_dataframe([[batch]], name="batch_array")
+
+    df.select(is_null_arr(col("a"))).to_pandas()
+
+Additionally the :func:`.udaf` function allows you to define User-Defined 
Aggregate Functions (UDAFs)
+
+.. code-block:: python
+
+    import pyarrow
+    import pyarrow.compute
+    import datafusion
+    from datafusion import col, udaf, Accumulator
+
+    class MyAccumulator(Accumulator):
+        """
+        Interface of a user-defined accumulation.
+        """
+        def __init__(self):
+            self._sum = pyarrow.scalar(0.0)
+
+        def update(self, values: pyarrow.Array) -> None:
+            # not nice since pyarrow scalars can't be summed yet. This breaks 
on `None`
+            self._sum = pyarrow.scalar(self._sum.as_py() + 
pyarrow.compute.sum(values).as_py())
+
+        def merge(self, states: pyarrow.Array) -> None:
+            # not nice since pyarrow scalars can't be summed yet. This breaks 
on `None`
+            self._sum = pyarrow.scalar(self._sum.as_py() + 
pyarrow.compute.sum(states).as_py())
+
+        def state(self) -> pyarrow.Array:
+            return pyarrow.array([self._sum.as_py()])
+
+        def evaluate(self) -> pyarrow.Scalar:
+            return self._sum
+
+    ctx = datafusion.SessionContext()
+    df = ctx.from_pydict(
+        {
+            "a": [1, 2, 3],
+            "b": [4, 5, 6],
+        }
+    )
+
+    my_udaf = udaf(MyAccumulator, pyarrow.float64(), pyarrow.float64(), 
[pyarrow.float64()], 'stable')
+
+    df.aggregate([],[my_udaf(col("a"))])
diff --git a/docs/source/user-guide/common-operations/windows.rst 
b/docs/source/user-guide/common-operations/windows.rst
new file mode 100644
index 0000000..f884c7e
--- /dev/null
+++ b/docs/source/user-guide/common-operations/windows.rst
@@ -0,0 +1,93 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Window Functions
+================
+
+In this section you will learn about window functions. A window function 
utilizes values from one or multiple rows to
+produce a result for each individual row, unlike an aggregate function that 
provides a single value for multiple rows.
+
+The functionality of window functions in DataFusion is supported by the 
dedicated :func:`.window` function.
+
+We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
+
+.. ipython:: python
+
+    import urllib.request
+    from datafusion import SessionContext
+    from datafusion import col
+    from datafusion import functions as f
+
+    urllib.request.urlretrieve(
+        
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv";,
+        "pokemon.csv",
+    )
+
+    ctx = SessionContext()
+    df = ctx.read_csv("pokemon.csv")
+
+Here is an example that shows how to compare each pokemons’s attack power with 
the average attack power in its :code:`"Type 1"`
+
+.. ipython:: python
+
+    df.select(
+        col('"Name"'),
+        col('"Attack"'),
+        f.alias(
+            f.window("avg", [col('"Attack"')], partition_by=[col('"Type 1"')]),
+            "Average Attack",
+        )
+    )
+
+You can also control the order in which rows are processed by window functions 
by providing
+a list of :func:`.order_by` functions for the :code:`order_by` parameter.
+
+.. ipython:: python
+
+    df.select(
+        col('"Name"'),
+        col('"Attack"'),
+        f.alias(
+            f.window(
+                "rank",
+                [],
+                partition_by=[col('"Type 1"')],
+                order_by=[f.order_by(col('"Attack"'))],
+            ),
+            "rank",
+        ),
+    )
+
+The possible window functions are:
+
+1. Rank Functions
+    - rank
+    - dense_rank
+    - row_number
+    - ntile
+
+2. Analytical Functions
+    - cume_dist
+    - percent_rank
+    - lag
+    - lead
+    - first_value
+    - last_value
+    - nth_value
+
+3. Aggregate Functions
+    - All aggregate functions can be used as window functions.
diff --git a/docs/source/api.rst b/docs/source/user-guide/introduction.rst
similarity index 58%
copy from docs/source/api.rst
copy to docs/source/user-guide/introduction.rst
index a5d6543..8abb911 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/introduction.rst
@@ -15,18 +15,29 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
+.. _guide:
+
+Introduction
+============
+
+Welcome to the User Guide for the Python bindings of Arrow DataFusion. This 
guide aims to provide an introduction to
+DataFusion through various examples and highlight the most effective ways of 
using it.
+
+Installation
+------------
+
+DataFusion is a Python library and, as such, can be installed via pip from 
`PyPI <https://pypi.org/project/datafusion>`__.
+
+.. code-block:: shell
+
+    pip install datafusion
+
+You can verify the installation by running:
+
+.. ipython:: python
+
+    import datafusion
+    datafusion.__version__
 
-*************
-API Reference
-*************
 
-.. toctree::
-   :maxdepth: 2
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
diff --git a/docs/source/api.rst b/docs/source/user-guide/io/avro.rst
similarity index 72%
copy from docs/source/api.rst
copy to docs/source/user-guide/io/avro.rst
index a5d6543..85d546e 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/io/avro.rst
@@ -15,18 +15,16 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
+Avro
+====
 
-*************
-API Reference
-*************
+`Avro <https://avro.apache.org/>`_ is a serialization format for record data. 
Reading an avro file is very straightforward
+with :meth:`.SessionContext.read_avro`
 
-.. toctree::
-   :maxdepth: 2
+.. code-block:: python
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
+
+    from datafusion import SessionContext
+
+    ctx = SessionContext()
+    df = ctx.read_avro("file.avro")
\ No newline at end of file
diff --git a/docs/source/api.rst b/docs/source/user-guide/io/csv.rst
similarity index 68%
copy from docs/source/api.rst
copy to docs/source/user-guide/io/csv.rst
index a5d6543..3f95c54 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/io/csv.rst
@@ -15,18 +15,22 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
+CSV
+===
 
-*************
-API Reference
-*************
+Reading a csv is very straightforward with :meth:`.SessionContext.read_csv`
 
-.. toctree::
-   :maxdepth: 2
+.. code-block:: python
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
+
+    from datafusion import SessionContext
+
+    ctx = SessionContext()
+    df = ctx.read_csv("file.csv")
+
+An alternative is to use :meth:`.SessionContext.register_csv`
+
+.. code-block:: python
+
+    ctx.register_csv("file", "file.csv")
+    df = ctx.table("file")
\ No newline at end of file
diff --git a/docs/source/api.rst b/docs/source/user-guide/io/index.rst
similarity index 83%
copy from docs/source/api.rst
copy to docs/source/user-guide/io/index.rst
index a5d6543..af08240 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/io/index.rst
@@ -15,18 +15,14 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
-
-*************
-API Reference
-*************
+IO
+==
 
 .. toctree::
    :maxdepth: 2
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
+   csv
+   parquet
+   json
+   avro
+
diff --git a/docs/source/api.rst b/docs/source/user-guide/io/json.rst
similarity index 69%
copy from docs/source/api.rst
copy to docs/source/user-guide/io/json.rst
index a5d6543..5949a03 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/io/json.rst
@@ -15,18 +15,15 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
+JSON
+====
+`JSON <https://www.json.org/json-en.html>`_ (JavaScript Object Notation) is a 
lightweight data-interchange format.
+When it comes to reading a JSON file, using :meth:`.SessionContext.read_json` 
is a simple and easy
 
-*************
-API Reference
-*************
+.. code-block:: python
 
-.. toctree::
-   :maxdepth: 2
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
+    from datafusion import SessionContext
+
+    ctx = SessionContext()
+    df = ctx.read_avro("file.json")
\ No newline at end of file
diff --git a/docs/source/api.rst b/docs/source/user-guide/io/parquet.rst
similarity index 65%
copy from docs/source/api.rst
copy to docs/source/user-guide/io/parquet.rst
index a5d6543..78bba30 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/io/parquet.rst
@@ -15,18 +15,22 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
+Parquet
+=======
 
-*************
-API Reference
-*************
+It is quite simple to read a parquet file using the 
:meth:`.SessionContext.read_parquet` function.
 
-.. toctree::
-   :maxdepth: 2
+.. code-block:: python
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
+
+    from datafusion import SessionContext
+
+    ctx = SessionContext()
+    df = ctx.read_parquet("file.parquet")
+
+An alternative is to use :meth:`.SessionContext.register_parquet`
+
+.. code-block:: python
+
+    ctx.register_parquet("file", "file.parquet")
+    df = ctx.table("file")
\ No newline at end of file
diff --git a/docs/source/api.rst b/docs/source/user-guide/sql.rst
similarity index 59%
copy from docs/source/api.rst
copy to docs/source/user-guide/sql.rst
index a5d6543..6fa7f0c 100644
--- a/docs/source/api.rst
+++ b/docs/source/user-guide/sql.rst
@@ -15,18 +15,25 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-.. _api:
+SQL
+===
 
-*************
-API Reference
-*************
+DataFusion also offers a SQL API, read the full reference `here 
<https://arrow.apache.org/datafusion/user-guide/sql/index.html>`_
 
-.. toctree::
-   :maxdepth: 2
+.. ipython:: python
 
-   api/config
-   api/dataframe
-   api/execution_context
-   api/expression
-   api/functions
-   api/object_store
+    import datafusion
+    from datafusion import col
+    import pyarrow
+
+    # create a context
+    ctx = datafusion.SessionContext()
+
+    # register a CSV
+    ctx.register_csv('pokemon', 'pokemon.csv')
+
+    # create a new statement via SQL
+    df = ctx.sql('SELECT "Attack"+"Defense", "Attack"-"Defense" FROM pokemon')
+
+    # collect and convert to pandas DataFrame
+    df.to_pandas()
\ No newline at end of file

[arrow-datafusion-python] branch main updated: enhancement(docs): Add user guide (#432) (#445)

Reply via email to