devin-petersohn commented on PR #53391:
URL: https://github.com/apache/spark/pull/53391#issuecomment-3656731761

   >... batch is actually the same as arrow_batch right? We need to encode it 
to an IPC bytes then decode because batch_to_bytes_iter runs in a worker and 
_get_arrow_array_partition_stream runs in the driver?
   
   Yes, generally correct. The workers convert their data to serialized arrow 
(done in `batch_to_bytes_iter`), which creates a `pyspark.DataFrame` with one 
column of type `ByteType`. `_get_arrow_array_partition_stream` is run in the 
driver as a helper function because it will be useful to have this in the case 
where we also want to add `__dataframe__` protocol.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to