[GitHub] [arrow] bkietz commented on a change in pull request #11309: ARROW-13227: [Documentation][Compute] Document ExecNode

GitBox Tue, 12 Oct 2021 12:45:22 -0700


bkietz commented on a change in pull request #11309:
URL: https://github.com/apache/arrow/pull/11309#discussion_r727440073




##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -0,0 +1,291 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+==========================
+Streaming execution engine
+==========================
+
+.. warning::
+
+    The streaming execution engine is experimental, and a stable API
+    is not yet guaranteed.
+
+Motivation
+----------
+
+For many complex computations, successive direct `invocation of
+compute functions <invoking compute functions>` is not feasible
+in either memory or computation time. Doing so causes all intermediate
+data to be fully materialized. To facilitate arbitrarily large inputs
+and more efficient resource usage, arrow also provides a streaming query
+engine with which computations can be formulated and executed.
+
+.. image:: simple_graph.svg
+
+:class:`ExecNode` is provided to reify the graph of operations in a query.
+Batches of data (struct:`ExecBatch`) flow along edges of the graph from
+node to node. Structuring the API around a stream of batches allows the
+working set for each node to be tuned for optimal performance independent
+of any other nodes in the graph. Each :class:`ExecNode` processes batches
+as they are pushed to it along an edge of the graph by upstream nodes
+(its inputs), and pushes batches along an edge of the graph to downstream
+nodes (its outputs) as they are finalized.
+
+.. [shaikhha et al] SHAIKHHA, A., DASHTI, M., & KOCH, C. (2018). Push versus 
pull-based loop fusion in query engines. Journal of Functional Programming, 28. 
https://doi.org/10.1017/s0956796818000102
+
+Overview
+--------
+
+:class:`ExecNode`
+  Each node in the graph is an implementation of the :class:`ExecNode` 
interface.
+
+:class:`ExecPlan`
+  A set of :class:`ExecNode` is contained and (to an extent) coordinated by an
+  :class:`ExecPlan`.
+
+:class:`ExecFactoryRegistry`
+  Instances of :class:`ExecNode` are constructed by factory functions held
+  in a :class:`ExecFactoryRegistry`.
+
+:class:`ExecNodeOptions`
+  Heterogenous parameters for factories of :class:`ExecNode` are bundled in an
+  :class:`ExecNodeOptions`.
+
+:struct:`Declaration`
+  ``dplyr``-inspired helper for efficient construction of an :class:`ExecPlan`.
+
+:struct:`ExecBatch`
+  A lightweight container for a single chunk of arrow-formatted data. In 
contrast
+  to :class:`RecordBatch`, :struct:`ExecBatch` is intended for use exclusively
+  in a streaming execution context (for example, it will never have a 
corresponding
+  Python binding). Furthermore columns which happen to have a constant value 
may
+  be represented by a :class:`Scalar` instead of an :class:`Array`. In 
addition,
+  :struct:`ExecBatch` may carry execution-relevant properties including a
+  guaranteed-true-filter for :class:`Expression` simplification.
+
+
+An example :class:`ExecNode` implementation which simlpy passes all input 
batches
+through unchanged::
+
+    class PassthruNode : public ExecNode {
+     public:
+      // InputReceived is the main entry point for ExecNodes. It is invoked
+      // by an input of this node to push a batch here for processing.
+      void InputReceived(ExecNode* input, ExecBatch batch) override {
+        // Since this is a passthru node we simply push the batch to our
+        // only output here.
+        outputs_[0]->InputReceived(this, batch);
+      }
+
+      // ErrorRecieved is called by an input of this node to report an error.
+      void ErrorReceived(ExecNode* input, Status error) override {
+        outputs_[0]->ErrorReceived(this, error);
+      }
+
+      // InputFinished is used to signal how many batches will ultimately 
arrive.
+      // It may be called with any ordering relative to 
InputReceived/ErrorReceived.
+      void InputFinished(ExecNode* input, int total_batches) override {
+        outputs_[0]->InputFinished(this, total_batches);
+      }
+
+      // ExecNodes may request that their inputs throttle production of batches
+      // until they are ready for more, or stop production if no further 
batches
+      // are required.
+      void ResumeProducing(ExecNode* output) override { 
inputs_[0]->ResumeProducing(this); }
+      void PauseProducing(ExecNode* output) override { 
inputs_[0]->PauseProducing(this); }
+      void StopProducing(ExecNode* output) override { 
inputs_[0]->StopProducing(this); }
+
+      // An ExecNode has a single output schema to which all its batches 
conform.
+      using ExecNode::output_schema;
+
+      // ExecNodes carry basic introspection for debugging purposes
+      const char* kind_name() const override { return "PassthruNode"; }
+      using ExecNode::label;

Review comment:
       It's just for documenting purposes, it happens to be valid C++

##########
File path: docs/source/cpp/streaming_execution.rst
##########
@@ -0,0 +1,291 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::compute
+
+==========================
+Streaming execution engine
+==========================
+
+.. warning::
+
+    The streaming execution engine is experimental, and a stable API
+    is not yet guaranteed.
+
+Motivation
+----------
+
+For many complex computations, successive direct `invocation of
+compute functions <invoking compute functions>` is not feasible
+in either memory or computation time. Doing so causes all intermediate
+data to be fully materialized. To facilitate arbitrarily large inputs
+and more efficient resource usage, arrow also provides a streaming query
+engine with which computations can be formulated and executed.
+
+.. image:: simple_graph.svg
+
+:class:`ExecNode` is provided to reify the graph of operations in a query.
+Batches of data (struct:`ExecBatch`) flow along edges of the graph from
+node to node. Structuring the API around a stream of batches allows the
+working set for each node to be tuned for optimal performance independent
+of any other nodes in the graph. Each :class:`ExecNode` processes batches
+as they are pushed to it along an edge of the graph by upstream nodes
+(its inputs), and pushes batches along an edge of the graph to downstream
+nodes (its outputs) as they are finalized.
+
+.. [shaikhha et al] SHAIKHHA, A., DASHTI, M., & KOCH, C. (2018). Push versus 
pull-based loop fusion in query engines. Journal of Functional Programming, 28. 
https://doi.org/10.1017/s0956796818000102
+
+Overview
+--------
+
+:class:`ExecNode`
+  Each node in the graph is an implementation of the :class:`ExecNode` 
interface.
+
+:class:`ExecPlan`
+  A set of :class:`ExecNode` is contained and (to an extent) coordinated by an
+  :class:`ExecPlan`.
+
+:class:`ExecFactoryRegistry`
+  Instances of :class:`ExecNode` are constructed by factory functions held
+  in a :class:`ExecFactoryRegistry`.
+
+:class:`ExecNodeOptions`
+  Heterogenous parameters for factories of :class:`ExecNode` are bundled in an
+  :class:`ExecNodeOptions`.
+
+:struct:`Declaration`
+  ``dplyr``-inspired helper for efficient construction of an :class:`ExecPlan`.
+
+:struct:`ExecBatch`
+  A lightweight container for a single chunk of arrow-formatted data. In 
contrast
+  to :class:`RecordBatch`, :struct:`ExecBatch` is intended for use exclusively
+  in a streaming execution context (for example, it will never have a 
corresponding
+  Python binding). Furthermore columns which happen to have a constant value 
may
+  be represented by a :class:`Scalar` instead of an :class:`Array`. In 
addition,
+  :struct:`ExecBatch` may carry execution-relevant properties including a
+  guaranteed-true-filter for :class:`Expression` simplification.
+
+
+An example :class:`ExecNode` implementation which simlpy passes all input 
batches
+through unchanged::
+
+    class PassthruNode : public ExecNode {
+     public:
+      // InputReceived is the main entry point for ExecNodes. It is invoked
+      // by an input of this node to push a batch here for processing.
+      void InputReceived(ExecNode* input, ExecBatch batch) override {
+        // Since this is a passthru node we simply push the batch to our
+        // only output here.
+        outputs_[0]->InputReceived(this, batch);
+      }
+
+      // ErrorRecieved is called by an input of this node to report an error.
+      void ErrorReceived(ExecNode* input, Status error) override {
+        outputs_[0]->ErrorReceived(this, error);
+      }
+
+      // InputFinished is used to signal how many batches will ultimately 
arrive.
+      // It may be called with any ordering relative to 
InputReceived/ErrorReceived.
+      void InputFinished(ExecNode* input, int total_batches) override {
+        outputs_[0]->InputFinished(this, total_batches);
+      }
+
+      // ExecNodes may request that their inputs throttle production of batches
+      // until they are ready for more, or stop production if no further 
batches
+      // are required.
+      void ResumeProducing(ExecNode* output) override { 
inputs_[0]->ResumeProducing(this); }
+      void PauseProducing(ExecNode* output) override { 
inputs_[0]->PauseProducing(this); }
+      void StopProducing(ExecNode* output) override { 
inputs_[0]->StopProducing(this); }
+
+      // An ExecNode has a single output schema to which all its batches 
conform.
+      using ExecNode::output_schema;
+
+      // ExecNodes carry basic introspection for debugging purposes
+      const char* kind_name() const override { return "PassthruNode"; }
+      using ExecNode::label;
+      using ExecNode::SetLabel;
+      using ExecNode::ToString;
+
+      // An ExecNode holds references to its inputs and outputs, so it is 
possible
+      // to walk the graph of execution if necessary.
+      using ExecNode::inputs;
+      using ExecNode::outputs;
+
+      // StartProducing() and StopProducing() are invoked by an ExecPlan to
+      // coordinate the graph-wide execution state.
+      Status StartProducing() override { return Status::OK(); }
+      void StopProducing() override {}
+      Future<> finished() override { return inputs_[0]->finished(); }
+    };
+
+Note that each method which is associated with an edge of the graph must be 
invoked
+with an ``ExecNode*`` to identify the node which invoked it. For example, in an
+:class:`ExecNode` which implements ``JOIN`` this tagging might be used to 
differentiate
+between batches from the left or right inputs.
+``InputReceived, ErrorReceived, InputFinished`` may only be invoked by the 
inputs of a

Review comment:
       will do




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on a change in pull request #11309: ARROW-13227: [Documentation][Compute] Document ExecNode

Reply via email to