JerAguilon opened a new issue, #41873: URL: https://github.com/apache/arrow/issues/41873
### Describe the enhancement requested The asof-join has a big inefficiency when emitting columns from the left hand side of the join. Let me explain it visually, using a simple example from [pandas.merge_asof](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html), the equivalent to Arrow's asof-join. Pandas is of course a completely separate library, but this example is small and illustrative. ``` >>> left = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]}) >>> left a left_val 0 1 a 1 5 b 2 10 c >>> right = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]}) >>> right a right_val 0 1 1 1 2 2 2 3 3 3 6 6 4 7 7 >>> pd.merge_asof(left, right, on="a") a left_val right_val 0 1 a 1 1 5 b 3 2 10 c 7 ``` Notice how * the output has the same # of rows as the left hand table. * more than that, the columns stemming from the LHS are simply contiguous slices of the LHS input table In Arrow's implementation, we emit output arrays by copying the data ([code pointer](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/unmaterialized_table.h#L208 )) cell-by-cell. This is necessary for the right hand side, where you may reuse or drop rows entirely. However, for the left hand side, it'd be much faster _and_ lower memory to just make zero copy `Array::Slice`s ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
