[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

via GitHub Tue, 21 Feb 2023 14:54:55 -0800


westonpace commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113653542



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and 
Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output 
order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order 
but that
+  /// odering is not represented by any column in the data.  The most common 
case for this
+  /// is when reading data from an in-memory table.  The data has an implicit 
"row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be 
done
+  /// other than ensure the index assigned to output batches is the same as the
+  /// input batch that was mapped.
+  ///
+  /// Other nodes may introduce order.  For example, an order-by node will emit
+  /// a brand new ordering independent of the input ordering.
+  ///
+  /// Finally, as described above, such as a hash-join or aggregation may may
+  /// destroy ordering (although these nodes could also choose to establish a
+  /// new ordering based on the hash keys).
+  ///
+  /// Some nodes will require an ordering.  For example, a fetch node or an
+  /// asof join node will only function if the input data is ordered (for fetch
+  /// it is enough to be implicitly ordered.  For an asof join the ordering 
must
+  /// be explicit and compatible with the on key.)
+  ///
+  /// Nodes that maintain ordering should be careful to avoid introducing gaps
+  /// in the batch index.  This may require emitting empty batches in order to
+  /// maintain continuity.

Review Comment:
   Good question.  The dataset writer will discard empty batches without 
writing anything.  However, the sink node still respects empty batches.  For 
example, if one were doing `dataset.to_batches(...)` then they might see an 
empty batch.
   
   I'm fairly certain this is consistent with the current implementation and 
not a change in behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #34137: GH-34136: [C++] Add a concept of ordering to ExecPlan

Reply via email to