bkietz opened a new pull request #9621:
URL: https://github.com/apache/arrow/pull/9621


   In order to keep this patch simpler, the execution framework for scalar 
aggregate kernels is reused for grouped aggregations. This is not intended as a 
permanent arrangement.
   
   A `compute::Function` is added which implements grouped aggregation.
   `GroupByOptions::aggregates` is a vector specifying which
   aggregations will be performed: each element is a
   GroupByOptions::Aggregate` containing the name of an aggregate
   function and a pointer to a `FunctionOptions`. The first arguments to
   `group_by` are interpreted as the corresponding aggregands and the remainder
   will be used as grouping keys. The output will be an array with the same
   number of fields where each slot contains the aggregation result and keys
   for a group:
   
   ```c++
   GroupByOptions options{
       {"sum", nullptr},  // first argument will be summed
       {"min_max",
        &min_max_options},  // second argument's extrema will be found
   };
   
   std::shared_ptr<arrow::Array> needs_sum = ...;
   std::shared_ptr<arrow::Array> needs_min_max = ...;
   std::shared_ptr<arrow::Array> key_0 = ...;
   std::shared_ptr<arrow::Array> key_1 = ...;
   
   ARROW_ASSIGN_OR_RAISE(arrow::Datum out,
                         arrow::compute::CallFunction("group_by",
                                                      {
                                                          needs_sum,
                                                          needs_min_max,
                                                          key_0,
                                                          key_1,
                                                      },
                                                      &options));
   
   // Unpack struct array result (a four-field array)
   auto out_array = out.array_as<StructArray>();
   std::shared_ptr<arrow::Array> sums = out_array->field(0);
   std::shared_ptr<arrow::Array> mins_and_maxes = out_array->field(1);
   std::shared_ptr<arrow::Array> group_key_0 = out_array->field(2);
   std::shared_ptr<arrow::Array> group_key_1 = out_array->field(3);
   ```
   
   TODO:
   - [ ] Only sum, count, and min_max aggregators are implemented
   - [ ] Add an aggregator which returns a list of row indices of members for 
use in partitioned dataset writing
   - [ ] Reorganization
   - [ ] Comments


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to