alamb commented on issue #21543:
URL: https://github.com/apache/datafusion/issues/21543#issuecomment-4237525173

   > Incremental Rows encoding. RowConverter::append already supports 
incrementally extending a Rows buffer across batches. ExternalSorter could 
maintain a Rows alongside its in_mem_batches, calling append as each batch 
arrives (similar to DuckDB's approach). At sort time, the encoding is already 
done — you just sort the accumulated Rows and use the indices to reorder the 
original batches. The tradeoff is higher memory during accumulation (raw 
batches + encoded rows), but encoding cost is fully amortized and radix sort 
gets a large contiguous run to work with.
   
   I think this is a good idea to pursue as well -- I also wonder if we have 
already created data in the row format, we could avoid the second copy entirely 
perhaps by keeping a list of sorted indices and then merging using those rather 
than copying the data again 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to