[GitHub] [arrow-datafusion] yjshen commented on issue #2723: Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs`

GitBox Wed, 15 Jun 2022 19:07:25 -0700


yjshen commented on issue #2723:
URL: 
https://github.com/apache/arrow-datafusion/issues/2723#issuecomment-1157151367

I agree we need `List` support in Row since it's used by
`ApproxPercentileCont`, `ArrayAgg`, `Distinct*`, etc., as state fields.
`Struct` is not currently used as a state for existing accumulators, but we
will need it soon for other purposes.

List and Struct would not be too complicated to implement in the row format:
- For a struct, we could regard it as a variable-length-field(varlena), and
interpret/serialize the data as another row.
- List would need one more size field denoting its length.

_You could refer to the Spark repo to lend some ideas from its `UnsafeRow`
implementations._

On the other hand, I suggest we evaluate the `Distinct*` aggregates through
double aggregation for more predictable/reasonable memory usage, the current
"collecting values into list" logic would OOM fast with hot groupings.

This optimizer rule would rewrites an aggregate query with distinct
aggregations into an expanded double aggregation in which the regular
aggregation expressions and every distinct clause is aggregated in a separate
group. The results are then combined in a second aggregate. Check this [Spark
optimizer
rule](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on issue #2723: Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs`

Reply via email to