EeshanBembi opened a new pull request, #19444:
URL: https://github.com/apache/datafusion/pull/19444

   Add garbage collection for StringView and BinaryView arrays before spilling 
to disk. This prevents sliced arrays from carrying their entire original 
buffers when written to spill files.
   
   Changes:
   - Add gc_view_arrays() function to apply GC on view arrays
   - Integrate GC into InProgressSpillFile::append_batch()
   - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)
   
   Fixes #19414 where GROUP BY on StringView columns created 820MB spill files 
instead of 33MB due to sliced arrays maintaining references to original buffers.
   
   Testing shows 80-98% reduction in spill file sizes for typical GROUP BY 
workloads.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to