Re: [I] Sort ClickBench data using 4GB on standard laptop (spilling) [datafusion]

via GitHub Tue, 09 Dec 2025 22:40:24 -0800


bharath-techie commented on issue #19216:
URL: https://github.com/apache/datafusion/issues/19216#issuecomment-3635622889


   > Therefore, I don't think the issue is related to TopK itself, but rather 
than memory usage of one of the GroupbyHash aggregate 
   
   I get this point, for each batch that gets inserted and referred to in TopK 
, the entire batch size gets added during the size estimation. [ as mentioned 
in https://github.com/apache/datafusion/issues/9562 ]
   
   ```
   /// Insert a record batch entry into this store, tracking its
       /// memory use, if it has any uses
       pub fn insert(&mut self, entry: RecordBatchEntry) {
           // uses of 0 means that none of the rows in the batch were stored in 
the topk
           if entry.uses > 0 {
               let size = get_record_batch_memory_size(&entry.batch);
               self.batches_size += size;
               println!("size during insert : {}", size);
               self.batches.insert(entry.id, entry);
           }
       }
   
   /// returns the size of memory used by this store, including all
       /// referenced `RecordBatch`es, in bytes
       pub fn size(&self) -> usize {
           // size_of::<Self>()
           //     + self.batches.capacity() * (size_of::<u32>() + 
size_of::<RecordBatchEntry>())
           //     + self.batches_size
   
           let sizeOfSelf = size_of::<Self>();
           let capacity = self.batches.capacity();
           let u32RecordBatch = size_of::<u32>() + 
size_of::<RecordBatchEntry>();
           let batchesSize = self.batches_size;
   
           let size = sizeOfSelf + capacity * u32RecordBatch + batchesSize;
           println!("self size : {} , capacity : {} , heap size : {}, batch 
size : {}",
                       sizeOfSelf, capacity, u32RecordBatch, batchesSize);
           println!("size during get : {}", size);
           size
       }
   
   ```
   For the above URL query, we have ~ 4 GB record batch in 
`groupByHashAggregate` which gets counted for each record batch that got added 
to topK 
   ```
   Record batch size during insert : 4196909056
   self size : 48 , capacity : 3 , heap size : 60, batch size : 4196909056
   size during get : 4196909284
   
   Record batch size during insert size during insert : 4196909056
   self size : 48 , capacity : 3 , heap size : 60, batch size : 8393818112
   size during get : 8393818340
   
   Record batch size during insert size during insert : 4196909056
   self size : 48 , capacity : 3 , heap size : 60, batch size : 12590727168
   size during get : 12590727396
   ```
   
   By going through previous such issues :
   
   - I think @Dandandan 
[mentioned](https://github.com/apache/datafusion/issues/9417#issuecomment-2431943283)
 force compaction when reaching memory limit, should we try that? 
   
   - https://github.com/apache/datafusion/pull/15591 could help as well or do 
we have any latest issues which can help here ?
   
   Apologies if I'm polluting this issue which is unrelated to topK - maybe we 
can discuss over in https://github.com/apache/datafusion/issues/9417 or in a 
new issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Sort ClickBench data using 4GB on standard laptop (spilling) [datafusion]

Reply via email to