yjshen commented on issue #2723: URL: https://github.com/apache/arrow-datafusion/issues/2723#issuecomment-1157151367
I agree we need `List` support in Row since it's used by `ApproxPercentileCont`, `ArrayAgg`, `Distinct*`, etc., as state fields. `Struct` is not currently used as a state for existing accumulators, but we will need it soon for other purposes. List and Struct would not be too complicated to implement in the row format: - For a struct, we could regard it as a variable-length-field(varlena), and interpret/serialize the data as another row. - List would need one more size field denoting its length. _You could refer to the Spark repo to lend some ideas from its `UnsafeRow` implementations._ On the other hand, I suggest we evaluate the `Distinct*` aggregates through double aggregation for more predictable/reasonable memory usage, the current "collecting values into list" logic would OOM fast with hot groupings. This optimizer rule would rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate. Check this [Spark optimizer rule](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
