Re: [I] External sorting not working for (maybe only for string columns??) [datafusion]

via GitHub Thu, 13 Feb 2025 04:20:31 -0800


alamb commented on issue #12136:
URL: https://github.com/apache/datafusion/issues/12136#issuecomment-2656430639


   > I have also encountered the same problem with string views.
   > 
   > DataFusion uses `interleave` function to produce merged batches, and 
`interleave` tends to produce batches that has super large size due to 
[apache/arrow-rs#6779](https://github.com/apache/arrow-rs/pull/6779). Although 
it simply references to the data buffers of interleaved arrays so it does not 
actually take extra memory space, but it makes the result of 
`get_record_batch_memory_size(batch)` or `batch.get_array_memory_size()` super 
large, increasing the chance of getting memory reservation failures.
   > 
   > When spilling happens, these interleaved arrays will be serialized using 
Arrow IPC and produces very large binaries. When we read them back in 
spill-read phase, we have to allocate super large buffers for these arrays, 
which makes things much worse.
   
   I think the fix for https://github.com/apache/arrow-rs/pull/6779 is in 
DataFusion 45 -- does this still happen?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] External sorting not working for (maybe only for string columns??) [datafusion]

Reply via email to