alamb commented on issue #17446: URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3356220342
> Slicing row of a struct arrays can be expensive, i wonder how polar store the data of this column during their execution, or do they consider the column as a struct type at all What I would suggest we do initially in DataFusion is: 1. Retain references to all input ArrayRefs and the corresponding group_indexes on aggregation phase 2. On output phase, form a new underlying values array by copying the relevant input rows in order (using the [`interleave` ](https://docs.rs/arrow/latest/arrow/compute/kernels/interleave/index.html)kernel) That approach requires no slicing and will work for any arbitrary array type This design will require 2x the input size at peak (for the input and the inprogress output) but I think is probably significantly better than using the Accumulators (which also holds each input array entirely, but also has a bunch of individual allocations per row to store the ScalarValues Here is a ascii art idea: GroupsAccumulator State ``` ┌────────┐ ┌─────┐ │ A │ │ 0 │ ├────────┤ ├─────┤ │ B │ │ 1 │ ├────────┤ ├─────┤ │ C │ │ 0 │ ├────────┤ ├─────┤ │ D │ │ 2 │ ├────────┤ ├─────┤ State │ E │ │ 3 │ └────────┘ └─────┘ Accumulator keeps all input `ArrayRef` and the corresponding ┌────────┐ ┌─────┐ group_indexes for each │ V │ │ 0 │ row ├────────┤ ├─────┤ │ W │ │ 0 │ ├────────┤ ├─────┤ │ X │ │ 4 │ ├────────┤ ├─────┤ │ Y │ │ 2 │ ├────────┤ ├─────┤ │ Z │ │ 4 │ └────────┘ └─────┘ ArrayRefs group indices ``` GroupsAccumulator output (emit): ``` Output ListArray ┌────────┐ ┌─────┐ │ A │ │ 0 │─── ├────────┤ ├─────┤ │ ┌ ─ ─ ▶ [A, C, V, W] │ C │ │ 0 │ ├────────┤ ├─────┤ │─ ─ ─ ┘ │ V │ │ 0 │ ├────────┤ ├─────┤ │ │ W │ │ 0 │─── ┌ ─ ─ ─ ▶ [B] ├────────┤ ├─────┤ │ B │ │ 1 │ ─ ─ ─ ─ ┘ ├────────┤ ├─────┤ │ D │ │ 2 │─── ┌ ─ ─ ▶ [D, Y] ├────────┤ ├─────┤ │─ ─ ─ │ Y │ │ 2 │─── ├────────┤ ├─────┤ ─ ─ ─▶ [E, X] │ E │ │ 3 │─── │ ├────────┤ ├─────┤ │─ ─ ─ │ X │ │ 3 │─── ├────────┤ ├─────┤ ─ ─ ─ ─ ▶ [Z] │ Z │ │ 4 │─ ─ ─ ─ ┘ └────────┘ └─────┘ Output: use the Output Values group interleave kernel to array indexes reorder the input arrays so all values from the same group index are contiguous New ArrayRef group and form the indices corresponding ListArray output ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
