jordepic opened a new pull request, #22862: URL: https://github.com/apache/datafusion/pull/22862
## Which issue does this PR close? - Closes #22861. ## Rationale for this change When using DataFusion comet I noticed that my hash join operator was failing with the following error: `Failed to acquire 142606336 bytes where 17142251456 bytes already reserved and the fair limit is 17179869184 bytes, 4 registered`. Looking into this more, DataFusion asks to reserve memory for each batch (by default 8192 rows) of the build side of a hash join - and tries to reserve (without actually allocating it) num_batches * batch_size. This is problematic when these are batches are zero-copy slices of a larger batch (e.g. GroupedHashAggregateStream), since the slice size is evaluated to be the size of the larger buffer. This is because the reference to the slice actually keeps the entire buffer from being freed. DataFusion doesn't overallocate memory (the underlying data is the same), but it does over-request it (in the centralized accounting system), which can lead to these "ResourcesExhausted" exceptions. ## What changes are included in this PR? In this change, we keep track of all of the buffers that we've already counted via a set of pointers. This way, we don't redundantly request memory for the whole arrow buffer for each sub-slice of it. We choose this approach as opposed to just requesting a smaller amount of memory per batch, because as mentioned before, the pointer to each batch technically keeps the entire arrow-buffer from being freed. ## Are these changes tested? The new hash join test fails on main with ResourcesExhausted and passes with this change. ## Are there any user-facing changes? No breaking changes. Adds a new public helper count_record_batch_memory_size to datafusion-common. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
