Tom-Newton commented on issue #41469: URL: https://github.com/apache/arrow/issues/41469#issuecomment-2088613252
Thanks for the suggestions. PySpark doesn't really support this but I can hack it to make it do this. > does it still reproduce after a roundtrip to Parquet? No, after a roundtrip to parquet the problem no longer occurs. To test I modified this section of pyspark https://github.com/apache/spark/blob/v3.5.1/python/pyspark/sql/pandas/serializers.py#L323-L324 <details> <summary>Approximately normal</summary> ```python for batch in batches: pyarrow_table = pa.Table.from_batches([batch]) yield [self.arrow_to_pandas(c) for c in pyarrow_table.itercolumns()] ``` </details> <details> <summary>With round trip in parquet</summary> ```python for batch in batches: import tempfile import pyarrow.parquet with tempfile.TemporaryFile() as tempdir: pyarrow_table = pa.Table.from_batches([batch]) pyarrow.parquet.write_table(pyarrow_table, tempdir) read_back_pyarrow_table = pyarrow.parquet.read_table(tempdir) yield [self.arrow_to_pandas(c) for c in read_back_pyarrow_table.itercolumns()] ``` </details> > Could you try creating an IPC file from the pyspark dataframe? (I don't know if pyspark provides the functionality for that) Or can you convert the pyspark dataframe to pyarrow first (not going through pandas), and then save it? I managed to hack something in <details> <summary>My hack</summary> ``` reader = pa.ipc.open_stream(stream) with open("/tmp/arrow_stream", "wb") as write_file: with pa.ipc.new_stream(write_file, reader.schema) as writer: for batch in reader: writer.write_batch(batch) with open("/tmp/arrow_stream", "rb") as read_file: with pa.ipc.open_stream(read_file) as reader: for batch in reader: yield batch ``` Original is https://github.com/apache/spark/blob/v3.5.1/python/pyspark/sql/pandas/serializers.py#L108-L113 </details> And now I have a way to reproduce with just `pyarrow`. ``` with open("/tmp/arrow_stream", "rb") as read_file: with pa.ipc.open_stream(read_file) as reader: schema = reader.schema for batch in reader: batches.extend([c.to_pandas(date_as_object=True) for c in record_batches_to_chunked_array(batch)]) ``` [arrow_stream.txt](https://github.com/apache/arrow/files/15178355/arrow_stream.txt) (the `.txt` extension is just so Github lets me upload it, its actually a binary dump of the arrow stream created with the hack I mentioned above). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
