[GitHub] [arrow] vibhatha commented on a change in pull request #12033: ARROW-15091: [C++][Doc] Document nodes in C++ streaming execution engine

GitBox Thu, 20 Jan 2022 03:30:12 -0800


vibhatha commented on a change in pull request #12033:
URL: https://github.com/apache/arrow/pull/12033#discussion_r788673710




##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -305,3 +305,497 @@ Datasets may be scanned multiple times; just make 
multiple scan
 nodes from that dataset. (Useful for a self-join, for example.)
 Note that producing two scan nodes like this will perform all
 reads and decodes twice.
+
+Constructing ``ExecNode`` using Options
+=======================================
+
+Using the execution plan we can construct various queries. 
+To construct such queries, we have provided a set of building blocks
+referred to as :class:`ExecNode` s. These nodes provide the ability to  
+construct operations like filtering, projection, join, etc. 
+
+This is the list of operations associated with the execution plan:
+
+.. list-table:: Operations and Options
+   :widths: 50 50
+   :header-rows: 1
+
+   * - Operation
+     - Options
+   * - ``source``
+     - :class:`arrow::compute::SourceNodeOptions`
+   * - ``filter``
+     - :class:`arrow::compute::FilterNodeOptions`
+   * - ``project``
+     - :class:`arrow::compute::ProjectNodeOptions`
+   * - ``aggregate``
+     - :class:`arrow::compute::ScalarAggregateOptions`
+   * - ``sink``
+     - :class:`arrow::compute::SinkNodeOptions`
+   * - ``consuming_sink``
+     - :class:`arrow::compute::ConsumingSinkNodeOptions`
+   * - ``order_by_sink``
+     - :class:`arrow::compute::OrderBySinkNodeOptions`
+   * - ``select_k_sink``
+     - :class:`arrow::compute::SelectKSinkNodeOptions`
+   * - ``scan``
+     - :class:`arrow::compute::ScanNodeOptions` 
+   * - ``hash_join``
+     - :class:`arrow::compute::HashJoinNodeOptions`
+   * - ``write``
+     - :class:`arrow::dataset::WriteNodeOptions`
+   * - ``union``
+     - N/A
+
+
+.. _stream_execution_source_docs:
+
+``source``
+----------
+
+A `source` operation can be considered as an entry point to create a streaming 
execution plan. 
+:class:`arrow::compute::SourceNodeOptions` are used to create the ``source`` 
operation.  The
+`source` operation is the most generic and flexible type of source currently 
available but it can
+be quite tricky to configure.  To process data from files the scan operation 
is likely a simpler choice.
+The source node requires some kind of function that can be called to poll for 
more data.  This
+function should take no arguments and should return an
+``arrow::Future<std::shared_ptr<arrow::util::optional<arrow::RecordBatch>>>``.
+This function might be reading a file, iterating through an in memory 
structure, or receiving data
+from a network connection.  The arrow library refers to these functions as 
`arrow::AsyncGenerator`
+and there are a number of utilities for working with these functions.  For 
this example we use 
+a vector of record batches that we've already stored in memory.
+In addition, the schema of the data must be known up front.  Arrow's streaming 
execution
+engine must know the schema of the data at each stage of the execution graph 
before any
+processing has begun.  This means we must supply the schema for a source node 
separately
+from the data itself.
+
+Struct to hold the data generator definition:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: BatchesWithSchema Definition)
+  :end-before: (Doc section: BatchesWithSchema Definition)
+  :linenos:
+  :lineno-match:
+
+Generating sample Batches for computation:
+
+.. literalinclude:: 
../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: MakeBasicBatches Definition)
+  :end-before: (Doc section: MakeBasicBatches Definition)
+  :linenos:
+  :lineno-match:
+
+Example of using ``source`` (usage of sink is explained in detail in 
:ref:`sink<stream_execution_sink_docs>`):
+
+.. literalinclude:: 
../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Source Example)
+  :end-before: (Doc section: Source Example)
+  :linenos:
+  :lineno-match:
+
+.. _stream_execution_filter_docs:
+
+``filter``
+----------
+
+``filter`` operation, as the name suggests, provides an option to define data 
filtering
+criteria. It selects rows matching a given expression. 
+Filters can be written using :class:`arrow::compute::Expression`. 
+For example, if we wish to keep rows where the value of column ``b``
+is greater than 3,  then we can use the following expression:: 
+
+  // a > 3
+  arrow::compute::Expression filter_opt = arrow::compute::greater(
+                                arrow::compute::field_ref("a"), 
+                                arrow::compute::literal(3));
+
+Using this option, the filter node can be constructed as follows::             
                                                                                
                                
+
+  // creating filter node
+  arrow::compute::ExecNode* filter;
+    ARROW_ASSIGN_OR_RAISE(filter, arrow::compute::MakeExecNode("filter", 
+                          // plan
+                          plan.get(),
+                          // previous node
+                          {scan}, 
+                          //filter node options
+                          arrow::compute::FilterNodeOptions{filter_opt}));

Review comment:
       I see your point. It is better to remove those comments. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] vibhatha commented on a change in pull request #12033: ARROW-15091: [C++][Doc] Document nodes in C++ streaming execution engine

Reply via email to