rabernat opened a new issue, #43236:
URL: https://github.com/apache/arrow/issues/43236

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I want to filter a pyarrow table based on a fixed-length tuple of ints. 
Filtering is not supported for compound dtypes like list arrays, so I am 
converting this tuple to a custom binary encoding and filtering against that.
   
   I am experiencing a segfault which occurs only when:
   - the data are sufficiently large (in this example we have 1_000_000 rows)
   - I use a `fixed_size_binary` type rather than `binary`
   
   
   Here is a MRE
   
   ```python
   from itertools import product
   import pyarrow.compute as pc
   import pyarrow as pa
   
   def encode_coord_as_fixed_size_binary(coord: tuple[int, ...], max_length=2) 
-> bytes:
       """Turn a tuple of ints into bytes"""
       out = b''
       for item in coord[::-1]:
           out += item.to_bytes(length=max_length)
       assert len(out) == len(coord) * max_length
       return out
   
   N = 100  # works fine for smaller data (e.g. N=10)
   nrows = N**3
   
   raw_data = list(product(range(N), repeat=3))
   encoded_data = [encode_coord_as_fixed_size_binary(coord) for coord in 
raw_data]
   array = pa.array(encoded_data, type=pa.binary())
   array_fixed_size = pa.array(encoded_data, pa.binary(3 * 2))
   
   table = pa.table([array, array_fixed_size], names=["binary", 
"fixed_size_binary"])
   
   # works fine
   table.filter(pc.field("binary") == encode_coord_as_fixed_size_binary((5, 3, 
9)))
   
   # this segfaults
   table.filter(pc.field("fixed_size_binary") == 
encode_coord_as_fixed_size_binary((5, 3, 9)))
   ```
   
   There is a core dumped, but it is like 500 MB, so I'm not including it here.
   
   I'm on:
   - Linux
   - Python 3.12.4
   - PyArrow '6.1.0
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to