[GitHub] [arrow-datafusion] wiedld opened a new pull request, #7379: feat(7181): cascading loser tree merges

via GitHub Tue, 22 Aug 2023 15:43:18 -0700


wiedld opened a new pull request, #7379:
URL: https://github.com/apache/arrow-datafusion/pull/7379


   **WIP: have a few optimizations todo, including those noted in this code.**
   
   
   ## Which issue does this PR close?
   External sorting (cascading merges) of the internal-sorted (in-memory) 
SortPreservingMergeStream.
   
   Closes https://github.com/apache/arrow-datafusion/issues/7181
   
   
   ## Rationale for this change
   Have a cascaded merge with each fan-in up to 10 streams.
   Potential performance improvements.
   
   Benchmarking of current WIP (prior to further improvements):
   <img width="640" alt="Screen Shot 2023-08-22 at 3 15 29 PM" 
src="https://github.com/apache/arrow-datafusion/assets/10232835/5e3dec14-363b-46e0-a168-3e2fd59654b5";>
   
   
   ## What changes are included in this PR?
   
   How it works:
   * Each merge uses the same code (in-memory sorting using the 
sort-preserving-merge loser tree).
   * The sort_order from previous merges, are yielded/streamed into the next 
merge in the cascade tree.
   * The cascade tree root (final merge) node does the construction and 
yielding of record batches.
   
   Performance considerations:
   * Focused on not adding too much code to the inner loop of the merge.
   * Each merge is only sorting the normalized keys (accessed via the cursors). 
   * The distribution function (a.k.a. how distribute streams between merges) 
is order preserving.
       * The loser tree (min heap), at each merge, has at least 1 pointer each 
from input stream.
       * Merge stream outputs, are inputs to the next merge stream.
       * Therefore, any downstream (cascaded) merges are also ordered and 
cannot advance beyond each other.
   * TODO:
       * have cycles of slicing, followed by concating, of rows at each merge 
yield.
          * Right now, it only slices. Which parses up our row cursors to be 
very small batches.
       * consider ways to remove mutex
          * mutex is because we require a single RowConverter for normalized 
key production. This slows down the leaf-nodes (which pull from 
RowCursorStream).
          * if we keep the mutex, then we could move the RecordBatch to a 
single store point (avoid slicing on merge yields) with the same approx 
overhead cost (on RowCursorStream polling).
   
   
   
   ## Are these changes tested?
   
   Current tests are passing, if and only if, the corresponding change in 
arrow-rs is also linked.
   
   How to run:
   1. Pull [this 
branch](https://github.com/wiedld/arrow-rs/tree/datafusion-7181/enable-row-slicing)
 for arrow-rs changes.
   
   2. Path dependencies for arrow, by adding the below to your workspace 
Cargo.toml for arrow-datafusion.
       ```
       [patch.crates-io]
       arrow = { path = "/Users/whatever/arrow-rs/arrow" }
       arrow-array = { path = "/Users/whatever/arrow-rs/arrow-array" }
       arrow-buffer = { path = "/Users/whatever/arrow-rs/arrow-buffer" }
       arrow-flight = { path = "/Users/whatever/arrow-rs/arrow-flight" }
       arrow-ord = { path= "/Users/whatever/arrow-rs/arrow-ord" }
       arrow-schema = { path = "/Users/whatever/arrow-rs/arrow-schema" }
       arrow-select = { path= "/Users/whatever/arrow-rs/arrow-select" }
       arrow-string = { path= "/Users/whatever/arrow-rs/arrow-string" }
       parquet = { path= "/Users/whatever/arrow-rs/parquet" }
       ```
   
   3. Main branch is already failing this test. Ignore:
      * 
(datasource::physical_plan::tests::schema_adapter_map_schema_with_projection) 
fails when using latest arrow main.       
      * Ignored this test for the rest of dev work.
   
   4. run tests per usual.
   
   
   ## Are there any user-facing changes?
   
   Changes are to internal APIs only.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] wiedld opened a new pull request, #7379: feat(7181): cascading loser tree merges

Reply via email to