JerAguilon opened a new issue, #41873:
URL: https://github.com/apache/arrow/issues/41873

   ### Describe the enhancement requested
   
   The asof-join has a big inefficiency when emitting columns from the left 
hand side of the join. Let me explain it visually, using a simple example from 
[pandas.merge_asof](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html),
 the equivalent to Arrow's asof-join. Pandas is of course a completely separate 
library, but this example is small and illustrative.
   
   ```
   >>> left = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]})
   >>> left
       a left_val
   0   1        a
   1   5        b
   2  10        c
   >>> right = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 
7]})
   >>> right
      a  right_val
   0  1          1
   1  2          2
   2  3          3
   3  6          6
   4  7          7
   >>> pd.merge_asof(left, right, on="a")
       a left_val  right_val
   0   1        a          1
   1   5        b          3
   2  10        c          7
   ```
   
   Notice how 
   
   * the output has the same # of rows as the left hand table. 
   * more than that, the columns stemming from the LHS are simply contiguous 
slices of the LHS input table
   
   In Arrow's implementation, we emit output arrays by copying the data ([code 
pointer](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/unmaterialized_table.h#L208
   )) cell-by-cell. This is necessary for the right hand side, where you may 
reuse or drop rows entirely.
   
   However, for the left hand side, it'd be much faster _and_ lower memory to 
just make zero copy `Array::Slice`s
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to