JerAguilon opened a new issue, #38381:
URL: https://github.com/apache/arrow/issues/38381

   ### Describe the enhancement requested
   
   I have a use case wherein I want to asof join two datasets. However, each 
dataset on the left and right hand side is sharded across N files. Each file is 
individually sorted from top to bottom. Imagine a dataset composed of two files:
   
   file A
   ```
   ts,col
   1,"foo"
   3,"foo"
   5,"foo"
   ```
   
   file B
   ```
   ts,col
   2,"bar"
   3,"bar"
   6,"bar"
   ```
   
   After merging
   ```
   ts,col
   1,"foo"
   2,"bar"
   3,"bar"
   3,"foo"
   5,"foo"
   6,"bar"
   ```
   
   Sorting using `order_by` isn't tenable for huge datasets. Merging N files is 
actually computationally similar to `asof_join_node.cc` in that you can 
efficiently do it by streaming data from top to bottom. 
   
   I propose refactoring some of the guts of `asof_join_node.cc` so that we can 
achieve the above computation. I think that this will unlock lots of potential 
that is hidden behind specialized databases like KDB.
   
   I have a draft PR for the idea here: 
https://github.com/apache/arrow/pull/38380 and have locally tested it 
   
   Would be curious to get opinions from @icexelloss, @bkietz, and @westonpace 
on this approach!
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to