alamb commented on issue #2148: URL: https://github.com/apache/arrow-datafusion/issues/2148#issuecomment-1089332433
Thoughts on N-way merging: We (Influxdb IOx in general and myself in particular) are very interested in this as well because the `SortPreservingMerge` is one of the key bottlenecks we see when sorting out data. Here is what I was thinking about how to proceed: 1. Create a benchmark for merging (including multi-column keys and variable length (Utf8) keys) 2. Spike out some tests Areas for investigation / things to spike out: 1. Use row-format sort key (similar to what @yjshen has done in https://github.com/apache/arrow-datafusion/pull/2146) so that the comparisons are done by comparing `[u8]` rather than array access 1. Use "Cascade Merge" rather than N-Way merge, as hinted at in the DuckDB blog: https://duckdb.org/2021/08/27/external-sorting.html 3. Figure out how to parallelize both the merge (the DuckDB blog has some hints) as well as the creation of RecordBatches from the inputs. I have thought about this but need more time to think through how it would work. cc @tustvold -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
