rabernat opened a new issue, #43236:
URL: https://github.com/apache/arrow/issues/43236
### Describe the bug, including details regarding any error messages,
version, and platform.
I want to filter a pyarrow table based on a fixed-length tuple of ints.
Filtering is not supported for compound dtypes like list arrays, so I am
converting this tuple to a custom binary encoding and filtering against that.
I am experiencing a segfault which occurs only when:
- the data are sufficiently large (in this example we have 1_000_000 rows)
- I use a `fixed_size_binary` type rather than `binary`
Here is a MRE
```python
from itertools import product
import pyarrow.compute as pc
import pyarrow as pa
def encode_coord_as_fixed_size_binary(coord: tuple[int, ...], max_length=2)
-> bytes:
"""Turn a tuple of ints into bytes"""
out = b''
for item in coord[::-1]:
out += item.to_bytes(length=max_length)
assert len(out) == len(coord) * max_length
return out
N = 100 # works fine for smaller data (e.g. N=10)
nrows = N**3
raw_data = list(product(range(N), repeat=3))
encoded_data = [encode_coord_as_fixed_size_binary(coord) for coord in
raw_data]
array = pa.array(encoded_data, type=pa.binary())
array_fixed_size = pa.array(encoded_data, pa.binary(3 * 2))
table = pa.table([array, array_fixed_size], names=["binary",
"fixed_size_binary"])
# works fine
table.filter(pc.field("binary") == encode_coord_as_fixed_size_binary((5, 3,
9)))
# this segfaults
table.filter(pc.field("fixed_size_binary") ==
encode_coord_as_fixed_size_binary((5, 3, 9)))
```
There is a core dumped, but it is like 500 MB, so I'm not including it here.
I'm on:
- Linux
- Python 3.12.4
- PyArrow '6.1.0
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]