[GitHub] [arrow] westonpace commented on a diff in pull request #14352: ARROW-17642: [C++] Add ordered aggregation

GitBox Fri, 23 Dec 2022 08:16:34 -0800


westonpace commented on code in PR #14352:
URL: https://github.com/apache/arrow/pull/14352#discussion_r1056079656



##########
cpp/src/arrow/compute/exec.h:
##########
@@ -180,7 +180,7 @@ struct ARROW_EXPORT ExecBatch {
 
   explicit ExecBatch(const RecordBatch& batch);
 
-  static Result<ExecBatch> Make(std::vector<Datum> values);
+  static Result<ExecBatch> Make(std::vector<Datum> values, int64_t length = 
-1);

Review Comment:
   Naively one looking at this might be confused why you now need both 
`ExecBatch::Make` and `ExecBatch::ExecBatch` since both take a vector of values 
and a length.
   
   Looking closer it seems the `Make` function does the extra work of verifying 
that the length of the datums match the given length.
   
   Could you add some comments explaining this for future readers?



##########
cpp/src/arrow/compute/exec.h:
##########
@@ -233,6 +233,17 @@ struct ARROW_EXPORT ExecBatch {
 
   ExecBatch Slice(int64_t offset, int64_t length) const;
 
+  Result<ExecBatch> SelectValues(const std::vector<int>& ids) const {

Review Comment:
   I know there aren't many comment blocks in this file but `ExecBatch` is less 
experimental than it used to be.  Can you comment this method?
   
   Also, is there any particular reason to have the implementation in the 
header file?



##########
cpp/src/arrow/compute/row/grouper.cc:
##########
@@ -44,7 +45,405 @@ namespace compute {
 
 namespace {
 
-struct GrouperImpl : Grouper {
+inline const uint8_t* GetValuesAsBytes(const ArrayData& data, int64_t offset = 
0) {
+  int64_t absolute_byte_offset = (data.offset + offset) * 
data.type->byte_width();

Review Comment:
   Maybe a `DCHECK` here to ensure that `data.type->byte_width() > 0` (e.g. to 
ensure that this is a fixed-width data type)



##########
cpp/src/arrow/compute/exec.h:
##########
@@ -233,6 +233,17 @@ struct ARROW_EXPORT ExecBatch {
 
   ExecBatch Slice(int64_t offset, int64_t length) const;
 
+  Result<ExecBatch> SelectValues(const std::vector<int>& ids) const {
+    std::vector<Datum> selected_values(ids.size());
+    for (size_t i = 0; i < ids.size(); i++) {
+      if (ids[i] < 0 || static_cast<size_t>(ids[i]) >= values.size()) {
+        return Status::Invalid("ExecBatch invalid value selection: ", ids[i]);
+      }
+      selected_values[i] = values[ids[i]];
+    }

Review Comment:
   ```suggestion
       std::vector<Datum> selected_values;
       selected_values.reserve(ids.size());
       for (int idx : ids) {
         if (idx < 0 || idx >= static_cast<int>(values.size())) {
           return Status::Invalid("ExecBatch invalid value selection: ", idx);
         }
         selected_values.push_back(values[idx]);
       }
   ```
   Minor nit: this seems slightly more readable with a foreach loop.



##########
cpp/src/arrow/compute/exec/aggregate_node.cc:
##########
@@ -352,15 +493,34 @@ class GroupByNode : public ExecNode {
       int key_field_id = key_field_ids[i];
       output_fields[base + i] = input_schema->field(key_field_id);
     }
+    base += keys.size();
+    for (size_t i = 0; i < segment_keys.size(); ++i) {
+      int segment_key_field_id = segment_key_field_ids[i];
+      output_fields[base + i] = input_schema->field(segment_key_field_id);

Review Comment:
   I don't think `output_fields` is large enough to do `output_fields[base + 
i]`.  I get a segmentation fault here.
   
   Are the segment keys part of the output?  I would expect them to be. 
However, I also don't see any place in GroupByNode::Finalize that is putting 
the segment keys into the output data.



##########
cpp/src/arrow/compute/row/grouper.h:
##########
@@ -39,10 +77,19 @@ class ARROW_EXPORT Grouper {
   static Result<std::unique_ptr<Grouper>> Make(const std::vector<TypeHolder>& 
key_types,
                                                ExecContext* ctx = 
default_exec_context());
 
-  /// Consume a batch of keys, producing the corresponding group ids as an 
integer array.
+  /// Consume a batch of keys, producing the corresponding group ids as an 
integer array,
+  /// over a slice defined by an offset and length, which defaults to the 
batch length.
+  /// Currently only uint32 indices will be produced, eventually the bit width 
will only
+  /// be as wide as necessary.

Review Comment:
   ```suggestion
   ```
   We can save this for a future PR or put it into a TODO but speculation of 
future issues should be left out of `///` comment blocks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #14352: ARROW-17642: [C++] Add ordered aggregation

Reply via email to