JerAguilon opened a new issue, #38381: URL: https://github.com/apache/arrow/issues/38381
### Describe the enhancement requested I have a use case wherein I want to asof join two datasets. However, each dataset on the left and right hand side is sharded across N files. Each file is individually sorted from top to bottom. Imagine a dataset composed of two files: file A ``` ts,col 1,"foo" 3,"foo" 5,"foo" ``` file B ``` ts,col 2,"bar" 3,"bar" 6,"bar" ``` After merging ``` ts,col 1,"foo" 2,"bar" 3,"bar" 3,"foo" 5,"foo" 6,"bar" ``` Sorting using `order_by` isn't tenable for huge datasets. Merging N files is actually computationally similar to `asof_join_node.cc` in that you can efficiently do it by streaming data from top to bottom. I propose refactoring some of the guts of `asof_join_node.cc` so that we can achieve the above computation. I think that this will unlock lots of potential that is hidden behind specialized databases like KDB. I have a draft PR for the idea here: https://github.com/apache/arrow/pull/38380 and have locally tested it Would be curious to get opinions from @icexelloss, @bkietz, and @westonpace on this approach! ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
