gabotechs opened a new pull request, #16816:
URL: https://github.com/apache/datafusion/pull/16816

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   - Closes #.
   
   ## Rationale for this change
   
   Follow up on:
   - https://github.com/apache/datafusion/pull/16346
   - https://github.com/apache/datafusion/pull/16519
   
   The `ArrayAggAccumulator`, unlike `DistinctArrayAggAccumulator` or 
`OrderSensitiveArrayAggAccumulator` or many other accumulators, accumulates the 
values by directly referencing to the source `ArrayRef`s rather than making a 
`ScalarValue` out of it. This is good for performance as we can afford to just 
keep references to the original buffers, but it has the drawback that it makes 
complicated measuring the memory consumed by the `ArrayAggAccumulator`.
   
   When calling `ArrayRef::get_array_memory_size()` we get the memory occupied 
by the whole underlaying buffer, but it's not technically true that 
`ArrayAggAccumulator` is occupying all that space. 
   
   I found this other `ArrayData::get_slice_memory_size()` method with the 
following docs:
   ```
   Returns the total number of the bytes of memory occupied by the buffers by 
this slice of ArrayData (See also diagram on ArrayData).
   This is approximately the number of bytes if a new ArrayData was formed by 
creating new Buffers with exactly the data needed.
   For example, a DataType::Int64 with 100 elements, 
Self::get_slice_memory_size would return 100 * 8 = 800. If the ArrayData was 
then Self::sliceed to refer to its first 20 elements, then 
Self::get_slice_memory_size on the sliced ArrayData would return 20 * 8 = 160.
   ``` 
   
   Which leads me to think that if we are accumulating `ArrayRefs`, rather than 
compacting them by copying their data, it might be better to not do it and just 
report the consumed memory by calling `get_slice_memory_size()`. I think it's 
still not technically true that `ArrayAggAccumulator` is occupying that space, 
as it's technically memory not owned by `ArrayAggAccumulator`, it's just 
referenced by it but owned by someone else, but it's the closest thing to 
reality that I was able to come up with.
   
   ## What changes are included in this PR?
   
   Stops copying ArrayRef data in `ArrayAggAccumulator` and starts measuring 
its occupied size with `get_slice_memory_size()`
   
   ## Are these changes tested?
   
   yes, by existing tests
   
   ## Are there any user-facing changes?
   
   People should stop seeing ResourceExhausted errors in aggregations that 
imply `ArrayAgg`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to