bepec opened a new issue, #41706:
URL: https://github.com/apache/arrow/issues/41706
### Describe the usage question you have. Please include as many useful
details as possible.
With pyarrow 16.0.0, I can't apply join_asof although the input tables are
ordered by "on" key.
Noticed when trying to merge bigger sorted tables - for example, it fails
for tables with rows numbers 1061753 & 994046, but can be executed if I reduce
numbers to 1048178 & 975257.
I think this behavior can be reproduced with an example below:
```
import numpy as np
ts0 = 0
nticks = 2_000_000 # it's OK for nticks = 1_000_000
ncats = 10
ticks = np.arange(ts0, ts0 + nticks)
cats = np.arange(0, ncats).repeat(nticks/ncats)
t1 = pa.Table.from_pydict({"ts": ticks, "cats": cats})
t2 = pa.Table.from_pydict({"ts": ticks, "cats": cats})
t1.join_asof(t2, on="ts", tolerance=-10, by="cats")
# Last line fails with error:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[273], line 10
8 t1 = pa.Table.from_pydict({"ts": ticks, "cats": cats})
9 t2 = pa.Table.from_pydict({"ts": ticks, "cats": cats})
---> 10 t1.join_asof(t2, on="ts", tolerance=-10, by="cats")
File /lib/python3.10/site-packages/pyarrow/table.pxi:5528, in
pyarrow.lib.Table.join_asof()
File /lib/python3.10/site-packages/pyarrow/acero.py:333, in
_perform_join_asof(left_operand, left_on, left_by, right_operand, right_on,
right_by, tolerance, use_threads, output_type)
326 join_opts = AsofJoinNodeOptions(
327 left_on, left_by, right_on, right_by, tolerance
328 )
329 decl = Declaration(
330 "asofjoin", options=join_opts, inputs=[left_source, right_source]
331 )
--> 333 result_table = decl.to_table(use_threads=use_threads)
335 if output_type == Table:
336 return result_table
File /lib/python3.10/site-packages/pyarrow/_acero.pyx:590, in
pyarrow._acero.Declaration.to_table()
File /lib/python3.10/site-packages/pyarrow/error.pxi:154, in
pyarrow.lib.pyarrow_internal_check_status()
File /lib/python3.10/site-packages/pyarrow/error.pxi:91, in
pyarrow.lib.check_status()
ArrowInvalid: AsofJoin does not allow out-of-order on-key values
```
So I suspect the issue has nothing to do with the on-key values order, but
rather the input size?
Is it the bug that can be fixed or some fundamental limitation?
Is there any workaround other than limiting input size?
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]