alamb commented on issue #21543: URL: https://github.com/apache/datafusion/issues/21543#issuecomment-4237525173
> Incremental Rows encoding. RowConverter::append already supports incrementally extending a Rows buffer across batches. ExternalSorter could maintain a Rows alongside its in_mem_batches, calling append as each batch arrives (similar to DuckDB's approach). At sort time, the encoding is already done — you just sort the accumulated Rows and use the indices to reorder the original batches. The tradeoff is higher memory during accumulation (raw batches + encoded rows), but encoding cost is fully amortized and radix sort gets a large contiguous run to work with. I think this is a good idea to pursue as well -- I also wonder if we have already created data in the row format, we could avoid the second copy entirely perhaps by keeping a list of sorted indices and then merging using those rather than copying the data again 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
