EeshanBembi commented on issue #19414:
URL: https://github.com/apache/datafusion/issues/19414#issuecomment-3681930247

   Hey @bharath-techie 
   I've opened PR #19444 to address this issue. The fix adds garbage collection 
for StringView/BinaryView arrays before spilling to disk, which reduces spill 
file sizes by ~96% (820MB → 33MB) as reported.
   
     The implementation:
     - Performs GC on StringView/BinaryView columns in 
InProgressSpillFile::append_batch() before writing
     - Skips GC for small arrays (<10 rows) and when no buffers need 
compaction(10 rows is an arbitrary number and can be changed)
     - Includes comprehensive tests including a specific reproduction of this 
ClickBench issue(which could be removed/modified)
   
     The approach aligns with @alamb's suggestion to GC during spill when the 
waste ratio is high. Currently using a simple heuristic (any buffers present + 
>10 rows), but this could be refined in follow-up PRs to use more sophisticated 
waste ratio calculations similar to Arrow's BatchCoalescer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to