[GitHub] [arrow] Crystrix commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

GitBox Tue, 08 Feb 2022 22:47:25 -0800


Crystrix commented on a change in pull request #12368:
URL: https://github.com/apache/arrow/pull/12368#discussion_r802311399




##########
File path: cpp/src/arrow/compute/kernels/hash_aggregate.cc
##########
@@ -2451,6 +2451,92 @@ Result<std::unique_ptr<KernelState>> 
GroupedDistinctInit(KernelContext* ctx,
   return std::move(impl);
 }
 
+// ----------------------------------------------------------------------
+// One implementation
+
+struct GroupedOneImpl : public GroupedAggregator {
+  Status Init(ExecContext* ctx, const std::vector<ValueDescr>&,
+              const FunctionOptions* options) override {
+    ctx_ = ctx;
+    pool_ = ctx->memory_pool();
+    return Status::OK();
+  }
+
+  Status Resize(int64_t new_num_groups) override {
+    num_groups_ = new_num_groups;
+    return Status::OK();
+  }
+
+  Status Consume(const ExecBatch& batch) override {
+    ARROW_ASSIGN_OR_RAISE(std::ignore, grouper_->Consume(batch));
+    return Status::OK();
+  }
+
+  Status Merge(GroupedAggregator&& raw_other,
+               const ArrayData& group_id_mapping) override {
+    auto other = checked_cast<GroupedOneImpl*>(&raw_other);
+
+    // Get (value, group_id) pairs, then translate the group IDs and consume 
them
+    // ourselves
+    ARROW_ASSIGN_OR_RAISE(auto uniques, other->grouper_->GetUniques());
+    ARROW_ASSIGN_OR_RAISE(auto remapped_g,
+                          AllocateBuffer(uniques.length * sizeof(uint32_t), 
pool_));
+
+    const auto* g_mapping = group_id_mapping.GetValues<uint32_t>(1);
+    const auto* other_g = uniques[1].array()->GetValues<uint32_t>(1);
+    auto* g = reinterpret_cast<uint32_t*>(remapped_g->mutable_data());
+
+    for (int64_t i = 0; i < uniques.length; i++) {
+      g[i] = g_mapping[other_g[i]];
+    }
+    uniques.values[1] =
+        ArrayData::Make(uint32(), uniques.length, {nullptr, 
std::move(remapped_g)});
+
+    return Consume(std::move(uniques));
+  }
+
+  Result<Datum> Finalize() override {

Review comment:
       I think the extra `grouper_` variable from `GroupedDistinctImpl` is not 
necessary as `hash_one` doesn't need to calculate distinct values. The struct 
of the `hash_one` can also learn from `GroupedMinMaxImpl`. In a way, `min/max` 
is a special case of `hash_one`.
   
   Like the `mins_` variable in `GroupedMinMaxImpl` which stores the min value 
of a group, we can have a similar variable to restore the values. Then the 
remaining operation should be similar to `GroupedMinMaxImpl` without value 
comparison. 
   
   - `Consume` function, store the value for each group if the value doesn't 
exist.
   - `Merge` function, add the values of new groups. 
   - `Finalize` function, output the values and groups, which is the same as 
`GroupedMinMaxImpl`.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Crystrix commented on a change in pull request #12368: ARROW-13993: [C++] [Compute] Add hash_one aggregate function

Reply via email to