gabotechs opened a new pull request, #16816: URL: https://github.com/apache/datafusion/pull/16816
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change Follow up on: - https://github.com/apache/datafusion/pull/16346 - https://github.com/apache/datafusion/pull/16519 The `ArrayAggAccumulator`, unlike `DistinctArrayAggAccumulator` or `OrderSensitiveArrayAggAccumulator` or many other accumulators, accumulates the values by directly referencing to the source `ArrayRef`s rather than making a `ScalarValue` out of it. This is good for performance as we can afford to just keep references to the original buffers, but it has the drawback that it makes complicated measuring the memory consumed by the `ArrayAggAccumulator`. When calling `ArrayRef::get_array_memory_size()` we get the memory occupied by the whole underlaying buffer, but it's not technically true that `ArrayAggAccumulator` is occupying all that space. I found this other `ArrayData::get_slice_memory_size()` method with the following docs: ``` Returns the total number of the bytes of memory occupied by the buffers by this slice of ArrayData (See also diagram on ArrayData). This is approximately the number of bytes if a new ArrayData was formed by creating new Buffers with exactly the data needed. For example, a DataType::Int64 with 100 elements, Self::get_slice_memory_size would return 100 * 8 = 800. If the ArrayData was then Self::sliceed to refer to its first 20 elements, then Self::get_slice_memory_size on the sliced ArrayData would return 20 * 8 = 160. ``` Which leads me to think that if we are accumulating `ArrayRefs`, rather than compacting them by copying their data, it might be better to not do it and just report the consumed memory by calling `get_slice_memory_size()`. I think it's still not technically true that `ArrayAggAccumulator` is occupying that space, as it's technically memory not owned by `ArrayAggAccumulator`, it's just referenced by it but owned by someone else, but it's the closest thing to reality that I was able to come up with. ## What changes are included in this PR? Stops copying ArrayRef data in `ArrayAggAccumulator` and starts measuring its occupied size with `get_slice_memory_size()` ## Are these changes tested? yes, by existing tests ## Are there any user-facing changes? People should stop seeing ResourceExhausted errors in aggregations that imply `ArrayAgg` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org