jordepic opened a new pull request, #22862:
URL: https://github.com/apache/datafusion/pull/22862

   ## Which issue does this PR close?
   
   - Closes #22861.
   
   ## Rationale for this change
   
   When using DataFusion comet I noticed that my hash join operator was failing 
with the following error: `Failed to acquire 142606336 bytes where 17142251456 
bytes already reserved and the fair limit is 17179869184 bytes, 4 registered`. 
Looking into this more, DataFusion asks to reserve memory for each batch (by 
default 8192 rows) of the build side of a hash join - and tries to reserve 
(without actually allocating it) num_batches * batch_size.  This is problematic 
when these are batches are zero-copy slices of a larger batch (e.g. 
GroupedHashAggregateStream), since the slice size is evaluated to be the size 
of the larger buffer. This is because the reference to the slice actually keeps 
the entire buffer from being freed. DataFusion doesn't overallocate memory (the 
underlying data is the same), but it does over-request it (in the centralized 
accounting system), which can lead to these "ResourcesExhausted" exceptions.
   
   ## What changes are included in this PR?
   
   In this change, we keep track of all of the buffers that we've already 
counted via a set of pointers.  This way, we don't redundantly request memory 
for the whole arrow buffer for each sub-slice of it.  We choose this approach 
as opposed to just requesting a smaller amount of memory per batch, because as 
mentioned before, the pointer to each batch technically keeps the entire 
arrow-buffer from being freed.
   
   ## Are these changes tested?
   
   The new hash join test fails on main with ResourcesExhausted and passes with 
this change.
   
   ## Are there any user-facing changes?
   
   No breaking changes. Adds a new public helper count_record_batch_memory_size 
to datafusion-common.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to