0x26res opened a new issue, #41351:
URL: https://github.com/apache/arrow/issues/41351
### Describe the bug, including details regarding any error messages,
version, and platform.
I apologise in advance, this issue is contrived
I start with a chunked array of list of struct, for example
`pa.list_(pa.struct([pa.field("value", pa.float64())]))`.
I construct the chunked array in such a way that **the underlying values of
the list are shared among the array chunks.**
```python
values = pa.StructArray.from_arrays([pa.array([1, 2, 3, 4, 5, 6, 7])],
["values"])
my_array = pa.chunked_array(
[
pa.ListArray.from_arrays(
[0, 1, 2, 3],
values,
),
pa.ListArray.from_arrays(
[3, 4, 5, 6],
values,
),
]
)
```
I then try to flatten/explode the list array:
```python
flatten = pc.list_flatten(my_array)
```
And create a table from the flatten array chunks:
```
table = pa.Table.from_batches(
(pa.RecordBatch.from_struct_array(chunk) for chunk in
flatten.iterchunks()),
)
```
I then add a column to the table and call `to_batches` and it causes a seg
fault:
```
table.append_column("name", pa.repeat("foo", len(table))).to_batches()
```
One thing I've noticed is that the first chunk of the `flatten` array has
got a wrong `str` representation:
```
assert (
str(pa.RecordBatch.from_struct_array(flatten.chunks[0]))
== "pyarrow.RecordBatch\nvalues: int64\n----\nvalues: [1,2,3,4,5,6,7]"
)
```
It should show `[1,2,3]`
Full example:
```
import pyarrow as pa
import pyarrow.compute as pc
pa.list_(pa.struct([pa.field("value", pa.float64())]))
def test_wrong():
values = pa.StructArray.from_arrays([pa.array([1, 2, 3, 4, 5, 6, 7])],
["values"])
my_array = pa.chunked_array(
[
pa.ListArray.from_arrays(
[0, 1, 2, 3],
values,
),
pa.ListArray.from_arrays(
[3, 4, 5, 6],
values,
),
]
)
flatten = pc.list_flatten(my_array)
assert flatten.to_pylist() == [
{"values": 1},
{"values": 2},
{"values": 3},
{"values": 4},
{"values": 5},
{"values": 6},
]
assert pa.RecordBatch.from_struct_array(flatten.chunks[0]).to_pylist()
== [
{"values": 1},
{"values": 2},
{"values": 3},
]
assert (
str(pa.RecordBatch.from_struct_array(flatten.chunks[0]))
== "pyarrow.RecordBatch\nvalues: int64\n----\nvalues:
[1,2,3,4,5,6,7]"
)
pa.Table.from_batches([pa.RecordBatch.from_struct_array(flatten.chunks[0])])
table = pa.Table.from_batches(
(pa.RecordBatch.from_struct_array(chunk) for chunk in
flatten.iterchunks()),
)
table = table.append_column("name", pa.repeat("foo", len(table)))
table.to_batches()
```
A bit of context:
- tested with `pyarrow==16.0.0`
- this came up when exploding and filtering some data coming from parquet
- The example doesn't work if you omit the last element of the underlying
values (`7`).
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]