[GitHub] [arrow] westonpace commented on a change in pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

GitBox Thu, 03 Mar 2022 15:00:51 -0800


westonpace commented on a change in pull request #12555:
URL: https://github.com/apache/arrow/pull/12555#discussion_r819129303




##########
File path: cpp/examples/arrow/execution_plan_documentation_examples.cc
##########
@@ -353,6 +363,38 @@ arrow::Status SourceSinkExample(cp::ExecContext& 
exec_context) {
 }
 // (Doc section: Source Example)
 
+// (Doc section: Table Source Example)
+/**
+ * \brief
+ * TableSource-Sink Example
+ * This example shows how a tabl_source and sink can be used
+ * in an execution plan. This includes table source node
+ * receiving data as a table and the sink node emits
+ * the data as an output represented in a table.

Review comment:
       ```suggestion
    * This example shows how a table_source and sink can be used
    * in an execution plan. This includes a table source node
    * receiving data from a table and the sink node emits
    * the data to a generator which we collect into a table.
   ```

##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -396,6 +397,31 @@ Example of using ``source`` (usage of sink is explained in 
detail in :ref:`sink<
   :linenos:
   :lineno-match:
 
+``table_source``
+----------------
+
+.. _stream_execution_table_source_docs:
+
+In the previous example, :ref:`source node <stream_execution_source_docs>`, a 
source node
+was used to input the data and the node expected data to be fed using a 
generator. But in 
+developing application much easier and to enable performance optimization, the 
+:class:`arrow::compute::TableSourceNodeOptions` can be used. Here the input 
data can be 
+passed as a ``std::shared_ptr<arrow::Table>`` along with a ``max_batch_size``. 
The objetive
+of the ``max_batch_size`` is to enable the user to tune the performance given 
the nature of 
+vivid workloads. Another important fact is that the streaming execution engine 
will use
+the ``max_batch_size`` as execution batch size. And it is important to note 
that the output
+batches from an execution plan doesn't get merged to form larger batches when 
an inserted 
+table has smaller batch size. 

Review comment:
       ```suggestion
   was used to input the data.  But when developing an application, if the data 
is already in memory
   as a table, it is much easier, and more performant to use 
:class:`arrow::compute::TableSourceNodeOptions`.
   Here the input data can be passed as a ``std::shared_ptr<arrow::Table>`` 
along with a ``max_batch_size``. The  ``max_batch_size`` is to break up large 
record batches so that they can be processed in parallel.  It is important to 
note that the table batches will not get merged to form larger batches when the 
source table has a smaller batch size. 
   ```
   
   At the moment we use the source node's batch size as the execution batch 
size.  Recently, however, there have been talks about maybe using a smaller 
batch size for execution.  So we would slice a large chunk off of the source, 
create a thread task based on the large chunk, and then iterate through that 
large chunk in smaller chunks.  So for now, let's just avoid talking about the 
execution batch size until that is more finalized.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #12555: ARROW-15820: [C++][Doc] Add table_source to streaming_execution.rst & clarify parameter name

Reply via email to