wiedld opened a new pull request, #7379: URL: https://github.com/apache/arrow-datafusion/pull/7379
**WIP: have a few optimizations todo, including those noted in this code.** ## Which issue does this PR close? External sorting (cascading merges) of the internal-sorted (in-memory) SortPreservingMergeStream. Closes https://github.com/apache/arrow-datafusion/issues/7181 ## Rationale for this change Have a cascaded merge with each fan-in up to 10 streams. Potential performance improvements. Benchmarking of current WIP (prior to further improvements): <img width="640" alt="Screen Shot 2023-08-22 at 3 15 29 PM" src="https://github.com/apache/arrow-datafusion/assets/10232835/5e3dec14-363b-46e0-a168-3e2fd59654b5"> ## What changes are included in this PR? How it works: * Each merge uses the same code (in-memory sorting using the sort-preserving-merge loser tree). * The sort_order from previous merges, are yielded/streamed into the next merge in the cascade tree. * The cascade tree root (final merge) node does the construction and yielding of record batches. Performance considerations: * Focused on not adding too much code to the inner loop of the merge. * Each merge is only sorting the normalized keys (accessed via the cursors). * The distribution function (a.k.a. how distribute streams between merges) is order preserving. * The loser tree (min heap), at each merge, has at least 1 pointer each from input stream. * Merge stream outputs, are inputs to the next merge stream. * Therefore, any downstream (cascaded) merges are also ordered and cannot advance beyond each other. * TODO: * have cycles of slicing, followed by concating, of rows at each merge yield. * Right now, it only slices. Which parses up our row cursors to be very small batches. * consider ways to remove mutex * mutex is because we require a single RowConverter for normalized key production. This slows down the leaf-nodes (which pull from RowCursorStream). * if we keep the mutex, then we could move the RecordBatch to a single store point (avoid slicing on merge yields) with the same approx overhead cost (on RowCursorStream polling). ## Are these changes tested? Current tests are passing, if and only if, the corresponding change in arrow-rs is also linked. How to run: 1. Pull [this branch](https://github.com/wiedld/arrow-rs/tree/datafusion-7181/enable-row-slicing) for arrow-rs changes. 2. Path dependencies for arrow, by adding the below to your workspace Cargo.toml for arrow-datafusion. ``` [patch.crates-io] arrow = { path = "/Users/whatever/arrow-rs/arrow" } arrow-array = { path = "/Users/whatever/arrow-rs/arrow-array" } arrow-buffer = { path = "/Users/whatever/arrow-rs/arrow-buffer" } arrow-flight = { path = "/Users/whatever/arrow-rs/arrow-flight" } arrow-ord = { path= "/Users/whatever/arrow-rs/arrow-ord" } arrow-schema = { path = "/Users/whatever/arrow-rs/arrow-schema" } arrow-select = { path= "/Users/whatever/arrow-rs/arrow-select" } arrow-string = { path= "/Users/whatever/arrow-rs/arrow-string" } parquet = { path= "/Users/whatever/arrow-rs/parquet" } ``` 3. Main branch is already failing this test. Ignore: * (datasource::physical_plan::tests::schema_adapter_map_schema_with_projection) fails when using latest arrow main. * Ignored this test for the rest of dev work. 4. run tests per usual. ## Are there any user-facing changes? Changes are to internal APIs only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
