bharath-techie commented on issue #19216: URL: https://github.com/apache/datafusion/issues/19216#issuecomment-3635622889
> Therefore, I don't think the issue is related to TopK itself, but rather than memory usage of one of the GroupbyHash aggregate I get this point, for each batch that gets inserted and referred to in TopK , the entire batch size gets added during the size estimation. [ as mentioned in https://github.com/apache/datafusion/issues/9562 ] ``` /// Insert a record batch entry into this store, tracking its /// memory use, if it has any uses pub fn insert(&mut self, entry: RecordBatchEntry) { // uses of 0 means that none of the rows in the batch were stored in the topk if entry.uses > 0 { let size = get_record_batch_memory_size(&entry.batch); self.batches_size += size; println!("size during insert : {}", size); self.batches.insert(entry.id, entry); } } /// returns the size of memory used by this store, including all /// referenced `RecordBatch`es, in bytes pub fn size(&self) -> usize { // size_of::<Self>() // + self.batches.capacity() * (size_of::<u32>() + size_of::<RecordBatchEntry>()) // + self.batches_size let sizeOfSelf = size_of::<Self>(); let capacity = self.batches.capacity(); let u32RecordBatch = size_of::<u32>() + size_of::<RecordBatchEntry>(); let batchesSize = self.batches_size; let size = sizeOfSelf + capacity * u32RecordBatch + batchesSize; println!("self size : {} , capacity : {} , heap size : {}, batch size : {}", sizeOfSelf, capacity, u32RecordBatch, batchesSize); println!("size during get : {}", size); size } ``` For the above URL query, we have ~ 4 GB record batch in `groupByHashAggregate` which gets counted for each record batch that got added to topK ``` Record batch size during insert : 4196909056 self size : 48 , capacity : 3 , heap size : 60, batch size : 4196909056 size during get : 4196909284 Record batch size during insert size during insert : 4196909056 self size : 48 , capacity : 3 , heap size : 60, batch size : 8393818112 size during get : 8393818340 Record batch size during insert size during insert : 4196909056 self size : 48 , capacity : 3 , heap size : 60, batch size : 12590727168 size during get : 12590727396 ``` By going through previous such issues : - I think @Dandandan [mentioned](https://github.com/apache/datafusion/issues/9417#issuecomment-2431943283) force compaction when reaching memory limit, should we try that? - https://github.com/apache/datafusion/pull/15591 could help as well or do we have any latest issues which can help here ? Apologies if I'm polluting this issue which is unrelated to topK - maybe we can discuss over in https://github.com/apache/datafusion/issues/9417 or in a new issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
