yjshen commented on issue #2723:
URL: 
https://github.com/apache/arrow-datafusion/issues/2723#issuecomment-1157151367

   I agree we need `List` support in Row since it's used by 
`ApproxPercentileCont`, `ArrayAgg`, `Distinct*`, etc., as state fields. 
`Struct` is not currently used as a state for existing accumulators, but we 
will need it soon for other purposes.
   
   List and Struct would not be too complicated to implement in the row format:
   - For a struct, we could regard it as a variable-length-field(varlena), and 
interpret/serialize the data as another row.
   - List would need one more size field denoting its length.
   
   _You could refer to the Spark repo to lend some ideas from its `UnsafeRow` 
implementations._
   
   On the other hand, I suggest we evaluate the `Distinct*` aggregates through 
double aggregation for more predictable/reasonable memory usage, the current 
"collecting values into list" logic would OOM fast with hot groupings. 
   
   This optimizer rule would rewrites an aggregate query with distinct 
aggregations into an expanded double aggregation in which the regular 
aggregation expressions and every distinct clause is aggregated in a separate 
group. The results are then combined in a second aggregate. Check this [Spark 
optimizer 
rule](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala)
 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to