[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

via GitHub Fri, 12 May 2023 12:53:48 -0700


westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192746315



##########
docs/source/cpp/acero/user_guide.rst:
##########
@@ -0,0 +1,735 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==================
+Acero User's Guide
+==================
+
+This page describes how to use Acero.  It's recommended that you read the
+overview first and familiarize yourself with the basic concepts.
+
+Using Acero
+===========
+
+The basic workflow for Acero is this:
+
+#. First, create a graph of :class:`Declaration` objects describing the plan
+ 
+#. Call one of the DeclarationToXyz methods to execute the Declaration.
+
+   a. A new ExecPlan is created from the graph of Declarations.  Each 
Declaration will correspond to one
+      ExecNode in the plan.  In addition, a sink node will be added, depending 
on which DeclarationToXyz method
+      was used.
+
+   b. The ExecPlan is executed.  Typically this happens as part of the 
DeclarationToXyz call but in 
+      DeclarationToReader the reader is returned before the plan is finished 
executing.
+
+   c. Once the plan is finished it is destroyed
+
+Creating a Plan
+===============
+
+Using Substrait
+---------------
+
+Substrait is the preferred mechanism for creating a plan (graph of 
:class:`Declaration`).  There are a few
+reasons for this:
+
+* Substrait producers spend a lot of time and energy in creating user-friendly 
APIs for producing complex
+  execution plans in a simple way.  For example, the ``pivot_wider`` operation 
can be achieved using a complex
+  series of ``aggregate`` nodes.  Rather than create all of those 
``aggregate`` nodes by hand a producer will
+  give you a much simpler API.
+
+* If you are using Substrait then you can easily switch out to any other 
Substrait-consuming engine should you
+  at some point find that it serves your needs better than Acero.
+
+* We hope that tools will eventually emerge for Substrait-based optimizers and 
planners.  By using Substrait
+  you will be making it much easier to use these tools in the future.
+
+You could create the Substrait plan yourself but you'll probably have a much 
easier time finding an existing
+Susbstrait producer.  For example, you could use `ibis-substrait 
<https://github.com/ibis-project/ibis-substrait>`_
+to easily create Substrait plans from python expressions.  There are a few 
different tools that are able to create
+Substrait plans from SQL.  Eventually, we hope that C++ based Substrait 
producers will emerge.  However, we
+are not aware of any at this time.
+
+Detailed instructions on creating an execution plan from Substrait can be 
found in
+:ref:`the Substrait page<acero-substrait>`
+
+Programmatic Plan Creation
+--------------------------
+
+Creating an execution plan programmatically is simpler than creating a plan 
from Substrait, though loses some of
+the flexibility and future-proofing guarantees.  The simplest way to create a 
Declaration is to simply instantiate
+one.  You will need the name of the declaration, a vector of inputs, and an 
options object.  For example:
+
+.. literalinclude:: 
../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Example)
+  :end-before: (Doc section: Project Example)
+  :linenos:
+  :lineno-match:
+
+The above code creates a scan declaration (which has no inputs) and a project 
declaration (using the scan as
+input).  This is simple enough but we can make it slightly easier.  If you are 
creating a linear sequence of
+declarations (like in the above example) then you can also use the 
:func:`Declaration::Sequence` function.
+
+.. literalinclude:: 
../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Sequence Example)
+  :end-before: (Doc section: Project Sequence Example)
+  :linenos:
+  :lineno-match:
+
+There are many more examples of programmatic plan creation later in this 
document.
+
+Executing a Plan
+================
+
+There are a number of different methods that can be used to execute a 
declaration.  Each one provides the
+data in a slightly different form.  Since all of these methods start with 
``DeclarationTo...`` this guide
+will often refer to these methods as the ``DeclarationToXyz`` methods.
+
+DeclarationToTable
+------------------
+
+The :func:`DeclarationToTable` method will accumulate all of the results into 
a single :class:`arrow::Table`.
+This is perhaps the simplest way to collect results from Acero.  The main 
disadvantage to this approach is
+that it requires accumulating all results into memory.
+
+.. note::
+
+   Acero processes large datasets in small chunks.  This is described in more 
detail in the developer's guide.
+   As a result, you may be surprised to find that a table collected with 
DeclarationToTable is chunked
+   differently than your input.  For example, your input might be a large 
table with a single chunk with 2
+   million rows.  Your output table might then have 64 chunks with 32Ki rows 
each.  There is a current request
+   to specify the chunk size for the output in `GH-15155 
<https://github.com/apache/arrow/issues/15155>`_.
+
+DeclarationToReader
+-------------------
+
+The :func:`DeclarationToReader` method allows you to iteratively consume the 
results.  It will create an
+:class:`arrow::RecordBatchReader` which you can read from at your liesure.  If 
you do not read from the
+reader quickly enough then backpressure will be applied and the execution plan 
will pause.  Closing the
+reader will cancel the running execution plan and the reader's destructor will 
wait for the execution plan
+to finish whatever it is doing and so it may block.
+
+DeclarationToStatus
+-------------------
+
+The :func:`DeclarationToStatus` method is useful if you want to run the plan 
but do not actually want to
+consume the results.  For example, this is useful when benchmarking or when 
the plan has side effects such
+as a dataset write node.  If the plan generates any results then they will be 
immediately discarded.
+
+Running a Plan Directly
+-----------------------
+
+If one of the ``DeclarationToXyz`` methods is not sufficient for some reason 
then it is possible to run a plan
+directly.  This should only be needed if you are doing something unique.  For 
example, if you have created a
+custom sink node or if you need a plan that has multiple outputs.
+
+.. note::
+   In academic literature and many existing systems there is a general 
assumption that an execution plan has
+   at most one output.  There are some things in Acero, such as the 
DeclarationToXyz methods, which will expect
+   this.  However, there is nothing in the design that strictly prevents 
having multiple sink nodes.
+
+Detailed instructions on how to do this are out of scope for this guide but 
the rough steps are:
+
+1. Create a new :class:`ExecPlan` object.
+2. Add sink nodes to your graph of :class:`Declaration` objects (this is the 
only type you will need
+   to create declarations for sink nodes)
+3. Use :func:`Declaration::AddToPlan` to add your declaration to your plan (if 
you have more than one output
+   then you will not be able to use this method and will need to add your 
nodes one at a time)
+4. Validate the plan with :func:`ExecPlan::Validate`
+5. Start the plan with :func:`ExecPlan::StartProducing`
+6. Wait for the future returned by :func:`ExecPlan::finished` to complete.
+
+Providing Input
+===============
+
+Input data for an exec plan can come from a variety of sources.  It is often 
read from files stored on some
+kind of filesystem.  It is also common for input to come from in-memory data.  
In-memory data is typical, for
+example, in a pandas-like frontend.  Input could also come from network 
streams like a Flight request.  Acero
+can support all of these cases and can even support unique and custom 
situations not mentioned here.
+
+There are pre-defined source nodes that cover the most common input scenarios. 
 These are listed below.  However,
+if your source data is unique then you will need to use the generic ``source`` 
node.  This node expects you to
+provide an asycnhronous stream of batches and is covered in more detail 
:ref:`here <stream_execution_source_docs>`.
+
+.. _ExecNode List:
+
+Available ``ExecNode`` Implementations
+======================================
+
+The following tables quickly summarize the available operators.
+
+Sources
+-------
+
+These nodes can be used as sources of data
+
+.. list-table:: Source Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``source``
+     - :class:`SourceNodeOptions`
+     - A generic source node that wraps an asynchronous stream of data 
(:ref:`example <stream_execution_source_docs>`)
+   * - ``table_source``
+     - :class:`TableSourceNodeOptions`
+     - Generates data from an :class:`arrow::Table` (:ref:`example 
<stream_execution_table_source_docs>`)
+   * - ``record_batch_source``
+     - :class:`RecordBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::RecordBatch`
+   * - ``record_batch_reader_source``
+     - :class:`RecordBatchReaderSourceNodeOptions`
+     - Generates data from an :class:`arrow::RecordBatchReader`
+   * - ``exec_batch_source``
+     - :class:`ExecBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::compute::ExecBatch`
+   * - ``array_vector_source``
+     - :class:`ArrayVectorSourceNodeOptions`
+     - Generates data from an iterator of vectors of :class:`arrow::Array`
+   * - ``scan``
+     - :class:`arrow::dataset::ScanNodeOptions`
+     - Generates data from an `arrow::dataset::Dataset` (requires the datasets 
module)
+       (:ref:`example <stream_execution_scan_docs>`)
+
+Compute Nodes
+-------------
+
+These nodes perform computations on data and may transform or reshape the data
+
+.. list-table:: Compute Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``filter``
+     - :class:`FilterNodeOptions`
+     - Removes rows that do not match a given filter expression
+   * - ``project``
+     - :class:`ProjectNodeOptions`
+     - Creates new columns by evaluating compute expressions.  Can also drop 
and reorder columns
+       (:ref:`example <stream_execution_project_docs>`)
+   * - ``aggregate``
+     - :class:`AggregateNodeOptions`
+     - Calculates summary statistics across the entire input stream or on 
groups of data
+       (:ref:`example <stream_execution_aggregate_docs>`)
+   * - ``pivot_longer``
+     - :class:`PivotLongerNodeOptions`
+     - Reshapes data by converting some columns into additional rows
+
+Arrangement Nodes
+-----------------
+
+These nodes reorder, combine, or slice streams of data
+
+.. list-table:: Arrangement Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``hash_join``
+     - :class:`HashJoinNodeOptions`
+     - Joins two inputs based on common columns (:ref:`example 
<stream_execution_hashjoin_docs>`)

Review Comment:
   I added asofjoin to the table.  Regrettably we don't have an example yet but 
I'll defer that for follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Reply via email to