bkietz commented on a change in pull request #11309: URL: https://github.com/apache/arrow/pull/11309#discussion_r727439323
########## File path: docs/source/cpp/streaming_execution.rst ########## @@ -0,0 +1,291 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. default-domain:: cpp +.. highlight:: cpp +.. cpp:namespace:: arrow::compute + +========================== +Streaming execution engine +========================== + +.. warning:: + + The streaming execution engine is experimental, and a stable API + is not yet guaranteed. + +Motivation +---------- + +For many complex computations, successive direct `invocation of +compute functions <invoking compute functions>` is not feasible +in either memory or computation time. Doing so causes all intermediate +data to be fully materialized. To facilitate arbitrarily large inputs +and more efficient resource usage, arrow also provides a streaming query +engine with which computations can be formulated and executed. + +.. image:: simple_graph.svg + +:class:`ExecNode` is provided to reify the graph of operations in a query. +Batches of data (struct:`ExecBatch`) flow along edges of the graph from +node to node. Structuring the API around a stream of batches allows the +working set for each node to be tuned for optimal performance independent +of any other nodes in the graph. Each :class:`ExecNode` processes batches +as they are pushed to it along an edge of the graph by upstream nodes +(its inputs), and pushes batches along an edge of the graph to downstream +nodes (its outputs) as they are finalized. + +.. [shaikhha et al] SHAIKHHA, A., DASHTI, M., & KOCH, C. (2018). Push versus pull-based loop fusion in query engines. Journal of Functional Programming, 28. https://doi.org/10.1017/s0956796818000102 + +Overview +-------- + +:class:`ExecNode` + Each node in the graph is an implementation of the :class:`ExecNode` interface. + +:class:`ExecPlan` + A set of :class:`ExecNode` is contained and (to an extent) coordinated by an + :class:`ExecPlan`. + +:class:`ExecFactoryRegistry` + Instances of :class:`ExecNode` are constructed by factory functions held + in a :class:`ExecFactoryRegistry`. + +:class:`ExecNodeOptions` + Heterogenous parameters for factories of :class:`ExecNode` are bundled in an + :class:`ExecNodeOptions`. + +:struct:`Declaration` + ``dplyr``-inspired helper for efficient construction of an :class:`ExecPlan`. + +:struct:`ExecBatch` + A lightweight container for a single chunk of arrow-formatted data. In contrast + to :class:`RecordBatch`, :struct:`ExecBatch` is intended for use exclusively + in a streaming execution context (for example, it will never have a corresponding Review comment: very well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org