alamb commented on issue #2148:
URL: 
https://github.com/apache/arrow-datafusion/issues/2148#issuecomment-1089332433

   Thoughts on N-way merging:
   
   We (Influxdb IOx in general and myself in particular) are very interested in 
this as well because the `SortPreservingMerge` is one of the key bottlenecks we 
see when sorting out data. 
   
   Here is what I was thinking about how to proceed:
   1. Create a benchmark for merging (including multi-column keys and variable 
length (Utf8) keys)
   2. Spike out some tests
   
   Areas for investigation / things to spike out:
   1. Use row-format sort key (similar to what @yjshen  has done in 
https://github.com/apache/arrow-datafusion/pull/2146) so that the comparisons 
are done by comparing `[u8]` rather than array access
   1.  Use "Cascade Merge" rather than N-Way merge, as hinted at in the DuckDB 
blog: https://duckdb.org/2021/08/27/external-sorting.html
   3. Figure out how to parallelize both the merge (the DuckDB blog has some 
hints) as well as the creation of RecordBatches from the inputs. I have thought 
about this but need more time to think through how it would work.
   
   cc @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to