lccnl opened a new issue, #34639:
URL: https://github.com/apache/arrow/issues/34639
### Describe the bug, including details regarding any error messages,
version, and platform.
Hello,
it seems that when a struct is pointing to a part of a larger array via an
offset and we try to convert it to a record batch, that information is lost and
we get a record batch with columns having the larger array length.
For now, a workaround is to select all the actual values in the array before
converting it to a recordbatch (though this solution does not scale well,
slicing does not work).
The following code reproduces the error and shows the workaround:
```import pyarrow as pa
# create a struct and a table having as column the struct, then split it
into record batches
struct_1=pa.StructArray.from_arrays([pa.array([1.,2.]),pa.array(['a','b'])],names=['col1','col2'])
out=pa.Table.from_arrays(arrays=[struct_1],names=['struct'])
batches=out.to_batches(max_chunksize=1)
#convert to struct arrays, here each array has an offset referencing the
table
arrays=[pa.StructArray.from_arrays(batch.columns,names=batch.schema.names)
for batch in batches]
#select the struct
modified_arrays=[array.field('struct') for array in arrays]
# just take length of each array to show difference
taken_arrays=[array.take(pa.array([0])) for array in modified_arrays]
for standard,taken in zip(modified_arrays,taken_arrays):
#arrays are equals
assert standard==taken
assert len(standard)==len(taken)==1
#but record batches are different!
assert len(pa.RecordBatch.from_struct_array(standard).column(0))==2
assert len(pa.RecordBatch.from_struct_array(taken).column(0))==1 ```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]