jorisvandenbossche commented on code in PR #34137:
URL: https://github.com/apache/arrow/pull/34137#discussion_r1113577301


##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)

Review Comment:
   If there is an order based on a column "x", does that also guarantee 
something about the order _within_ each batch? (or only between batches as this 
paragraph explains)



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and 
Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output 
order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order 
but that
+  /// odering is not represented by any column in the data.  The most common 
case for this
+  /// is when reading data from an in-memory table.  The data has an implicit 
"row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be 
done

Review Comment:
   What is exactly meant with "map node"? (I don't find this term used anywhere 
else in the compute / Acero code or docs) Do we mean a node that uses typical 
element-wise scalar kernels? Also something like a filter node will always 
preserve ordering.



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and 
Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output 
order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order 
but that
+  /// odering is not represented by any column in the data.  The most common 
case for this
+  /// is when reading data from an in-memory table.  The data has an implicit 
"row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be 
done
+  /// other than ensure the index assigned to output batches is the same as the
+  /// input batch that was mapped.
+  ///
+  /// Other nodes may introduce order.  For example, an order-by node will emit
+  /// a brand new ordering independent of the input ordering.
+  ///
+  /// Finally, as described above, such as a hash-join or aggregation may may
+  /// destroy ordering (although these nodes could also choose to establish a
+  /// new ordering based on the hash keys).
+  ///
+  /// Some nodes will require an ordering.  For example, a fetch node or an
+  /// asof join node will only function if the input data is ordered (for fetch
+  /// it is enough to be implicitly ordered.  For an asof join the ordering 
must
+  /// be explicit and compatible with the on key.)
+  ///
+  /// Nodes that maintain ordering should be careful to avoid introducing gaps
+  /// in the batch index.  This may require emitting empty batches in order to
+  /// maintain continuity.

Review Comment:
   Just wondering, but assume you would do a filter operation that filters a 
large part of the data, and so you might end up with several empty batches, 
does that affect for example sink nodes like writing files? (do we skip empty 
batches there, or do we then potentially write empty files for them?) 



##########
cpp/src/arrow/compute/exec/exec_plan.h:
##########
@@ -146,6 +148,46 @@ class ARROW_EXPORT ExecNode {
 
   virtual Status Validate() const;
 
+  /// \brief the ordering of the output batches
+  ///
+  /// This does not guarantee the batches will be emitted by this node
+  /// in order.  Instead it guarantees that the batches will have their
+  /// ExecBatch::index property set in a way that respects this ordering.
+  ///
+  /// In other words, given the ordering {{"x", SortOrder::Ascending}} we
+  /// know that all values of x in a batch with index N will be less than
+  /// or equal to all values of x in a batch with index N+k (assuming k > 0)
+  ///
+  /// Note that an ordering can be both Ordering::Unordered and 
Ordering::Implicit.
+  /// A node's output should be marked Ordering::Unordered if the order is
+  /// non-deterministic.  For example, a hash-join has no predictable output 
order.
+  ///
+  /// If the ordering is Ordering::Implicit then there is a meaningful order 
but that
+  /// odering is not represented by any column in the data.  The most common 
case for this
+  /// is when reading data from an in-memory table.  The data has an implicit 
"row order"
+  /// which is not neccesarily represented in the data set.
+  ///
+  /// A typical map node will not modify the ordering.  Nothing needs to be 
done

Review Comment:
   OK, I searched for "map node" and not for "MapNode" ;) 
   And I see that indeed both Project and Filter node inherit from MapNode 
(it's still a term that is not used in our documentation though)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to