[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

via GitHub Sun, 02 Apr 2023 08:18:11 -0700


rtpsw commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1493370198


   @westonpace, I've looked into this for a while and things are confusing.
   
   First, I've found that Acero aggregation produces an output schema different 
than the one you expect here, and in particular one that collects measurements 
then keys then segment keys. Here are my observations:
   - [The `ScalarAggregateNode` 
class](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L255),
 for which the keys are empty, gets [an output 
scheme](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L377)
 that collects 
[measurements](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L369)
 then [segment 
keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L372-L373).
   - [The `GroupByNode` 
class](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L533)
 gets [an output 
schema](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L663)
 that collects 
[measurements](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L648-L649)
 then 
[keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L654)
 then [segment 
keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L659).
   - A [comment in 
`GroupByNode`](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L646)
 says: "Aggregate fields come before key fields to match the behavior of 
GroupBy function".
   
   Second, the [output-schema-building Arrow Substrait 
code](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/engine/substrait/relation_internal.cc#L334-L357)
 you noted is trying to collect measurement then keys then segment keys. 
However, as you noted, it looks like it collects the measurements from the 
input schema instead of the output schema. I think the code should first 
construct the output schema (which involves accessing the `name` field of each 
`Aggregate`) and then collect from it.
   
   Third, the Substrait spec may have a different idea about the expected field 
order of an aggregation. Assuming so, we should probably fix Acero to use this 
order, especially since the `GroupBy` function mentioned in the comment [no 
longer exists](https://github.com/apache/arrow/issues/14866).
   
   Fourth, I suspect that a deeper reason for why we get away with such bugs is 
that Arrow Substrait generally doesn't care about field names, since Substrait 
generally doesn't (until the final output), so as long as the field type 
happens to be right then the test-case passes. Perhaps, besides testing 
after-projection, we should make the existing test-cases produce different 
types per field as much as possible.
   
   cc @icexelloss


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Reply via email to