lccnl opened a new issue, #34639:
URL: https://github.com/apache/arrow/issues/34639

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello,
   it seems that when a struct is pointing to a part of a larger array via an 
offset and we try to convert it to a record batch, that information is lost and 
we get a record batch with columns having the larger array length.
   
   For now, a workaround is to select all the actual values in the array before 
converting it to a recordbatch (though this solution does not scale well, 
slicing does not work).
   
   The following code reproduces the error and shows the workaround: 
   ```import pyarrow as pa
   
   # create a struct and a table having as column the struct, then split it 
into record batches
   
struct_1=pa.StructArray.from_arrays([pa.array([1.,2.]),pa.array(['a','b'])],names=['col1','col2'])
   out=pa.Table.from_arrays(arrays=[struct_1],names=['struct'])
   batches=out.to_batches(max_chunksize=1)
   
   #convert to  struct arrays, here each array has an offset referencing the 
table
   arrays=[pa.StructArray.from_arrays(batch.columns,names=batch.schema.names) 
for batch in batches]
   #select the struct
   modified_arrays=[array.field('struct') for array in arrays]
   
   # just take length of each array to show difference
   taken_arrays=[array.take(pa.array([0])) for array in modified_arrays]
   
   for standard,taken in zip(modified_arrays,taken_arrays):
       #arrays are equals
       assert standard==taken
       assert len(standard)==len(taken)==1
       #but record batches are different!
       assert len(pa.RecordBatch.from_struct_array(standard).column(0))==2
       assert len(pa.RecordBatch.from_struct_array(taken).column(0))==1 ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to