0x26res commented on issue #41706:
URL: https://github.com/apache/arrow/issues/41706#issuecomment-2127377220
I have a similar issue with a smaller table.
It only happens if I have a lot of small chunks in the table.
Here's an example:
```
import pyarrow as pa
import pytest
from pandas import Timestamp
LEFT = [
{"left_on": Timestamp("2023-09-07 12:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 12:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 12:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 12:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 13:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 13:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 13:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 13:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 14:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 14:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 14:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 14:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 15:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 15:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 15:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 15:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 16:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 16:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 16:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 16:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 17:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 17:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 17:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 17:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 18:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 18:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 18:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 18:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 19:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 19:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 19:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 19:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 20:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 20:15:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 20:30:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 20:45:00+0000", tz="UTC"), "left_by":
"SYM1"},
{"left_on": Timestamp("2023-09-07 21:00:00+0000", tz="UTC"), "left_by":
"SYM1"},
]
RIGHT = [
{
"right_on": Timestamp("2023-09-07 15:00:00+0000", tz="UTC"),
"right_by": "SYM1",
}
]
def test_asofjoin_order():
left: pa.Table = pa.Table.from_pylist(LEFT)
right = pa.Table.from_pylist(RIGHT)
left = pa.concat_tables(left[i : i + 1] for i in range(left.num_rows))
assert left[left.column_names[0]] == left[left.column_names[0]].sort()
assert right[right.column_names[0]] ==
right[right.column_names[0]].sort()
with pytest.raises(
pa.ArrowInvalid, match="AsofJoin does not allow out-of-order on-key
values"
):
left.join_asof(
right,
on=left.column_names[0],
by=left.column_names[1],
right_on=right.column_names[0],
right_by=right.column_names[1],
tolerance=-9_223_372_036_854_775_808,
)
```
it took a while to make a reproducible example. I can't exactly pin down
what is causing the issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]