jorisvandenbossche commented on issue #41936: URL: https://github.com/apache/arrow/issues/41936#issuecomment-2160965545
@PatrikBernhard the issue here is that your example pandas DataFrame consists of a chunked column (because of the concat step), and a RecordBatch is a data structure where each column consists of a single contiguous array. In pyarrow, that's the difference between a `RecordBatch` and a `Table` (RecordBatch being a collection of `Array` objects, and a Table a collection of `ChunkedArray` objects). So you will noticed that `pa.Table.from_pandas(concat_df)` works fine. Historically, pandas DataFrames always had columns that used a single non-chunked array under the hood, and that's the reason that `RecordBatch.from_pandas` currently does not support that. I am not entirely sure what the best solution is: keep raising the error (but maybe make it more informative or document this behaviour better) because people might not expect a copy in this conversion step, or automatically converting the chunked array to a contiguous array. As a comparison, directly constructing a RecordBatch from a ChunkedArray gives the same error: ```python In [10]: arr = pa.chunked_array([pa.array([1], pa.int32()), pa.array([2], pa.int32())]) In [11]: pa.RecordBatch.from_arrays([arr], names=["col"]) ... TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
