alamb commented on PR #6800:
URL:
https://github.com/apache/arrow-datafusion/pull/6800#issuecomment-1622290981
> I agree that would be fast, but this comes at the cost of storing every
seen value? How would we restrict memory usage this way?
Sorry what I meant was something like the following where the accumulator
only stored the current minimum values.
This approach would potentially end up with `min_storage` being full of
"garbage" if many batches had new minumums, but I think we could heuristically
"compact" `min_storage` (if it had `2*num_groups`, for example) if it got too
large
```
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│ Accumulator │
state
┌─────────┐ ┌─────────┐ │ ┌─────────┐ ┌─────────┐ │
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │
│ │ A │ │ │ │ A │─┼────┐ │ │ │ D │ │ ┌─────┼─│ 1 │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ │ ├─────┤ │ │ │ ├─────┤ │
│ │ B │ │ │ │ B │ │ └──┼─┼▶│ A │◀┼────┘ │ │ 0 │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ └─────┘ │ │ └─────┘ │
│ │ A │ │ │ │ A │ │ │ │ │ │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ │ │ │
│ │ A │ │ │ │ A │ │ │ │ │ │ │ │
│ ├─────┤ │ │ ├─────┤ │ │ │ │ │
│ │ C │ │ │ │ C │ │ │ │ │ │ │ │
│ └─────┘ │ │ └─────┘ │ │ │ │ │
└─────────┘ └─────────┘ │ └─────────┘ └─────────┘ │
input input │ min_storage: min_values │
values values Rows
(Array) (Rows) └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
step 1: step 2: for
convert any value step 3: min value
arguments to that is a new (per group) is
Row format group tracked as an
minimum, copy index into
min_storage `Rows`
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]