Tom-Newton commented on issue #41469:
URL: https://github.com/apache/arrow/issues/41469#issuecomment-2088613252

   Thanks for the suggestions. PySpark doesn't really support this but I can 
hack it to make it do this. 
   
   > does it still reproduce after a roundtrip to Parquet?
   
   No, after a roundtrip to parquet the problem no longer occurs. To test I 
modified this section of pyspark 
https://github.com/apache/spark/blob/v3.5.1/python/pyspark/sql/pandas/serializers.py#L323-L324
   <details>
     <summary>Approximately normal</summary>
   
     ```python
     for batch in batches:
         pyarrow_table = pa.Table.from_batches([batch])
         yield [self.arrow_to_pandas(c) for c in pyarrow_table.itercolumns()]
     ```
   </details>
   <details>
     <summary>With round trip in parquet</summary>
   
     ```python
     for batch in batches:
         import tempfile
         import pyarrow.parquet
         with tempfile.TemporaryFile() as tempdir:
             pyarrow_table = pa.Table.from_batches([batch])
             pyarrow.parquet.write_table(pyarrow_table, tempdir)
             read_back_pyarrow_table = pyarrow.parquet.read_table(tempdir)
         yield [self.arrow_to_pandas(c) for c in 
read_back_pyarrow_table.itercolumns()]
     ```
   </details>
   
   
   
   > Could you try creating an IPC file from the pyspark dataframe? (I don't 
know if pyspark provides the functionality for that) Or can you convert the 
pyspark dataframe to pyarrow first (not going through pandas), and then save it?
   
   I managed to hack something in 
   <details>
     <summary>My hack</summary>
   
     ```
           reader = pa.ipc.open_stream(stream)
   
           with open("/tmp/arrow_stream", "wb") as write_file:
               with pa.ipc.new_stream(write_file, reader.schema) as writer:
                   for batch in reader:
                       writer.write_batch(batch)
   
           with open("/tmp/arrow_stream", "rb") as read_file:
               with pa.ipc.open_stream(read_file) as reader:
                   for batch in reader:
                       yield batch
   ```
   Original is 
https://github.com/apache/spark/blob/v3.5.1/python/pyspark/sql/pandas/serializers.py#L108-L113
   </details>
   
   And now I have a way to reproduce with just `pyarrow`. 
   ```
           with open("/tmp/arrow_stream", "rb") as read_file:
               with pa.ipc.open_stream(read_file) as reader:
                   schema = reader.schema
   
                   for batch in reader:
                       batches.extend([c.to_pandas(date_as_object=True) for c 
in record_batches_to_chunked_array(batch)])
   ```
   
[arrow_stream.txt](https://github.com/apache/arrow/files/15178355/arrow_stream.txt)
  (the `.txt` extension is just so Github lets me upload it, its actually a 
binary dump of the arrow stream created with the hack I mentioned above).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to