devin-petersohn commented on PR #53391: URL: https://github.com/apache/spark/pull/53391#issuecomment-3656731761
>... batch is actually the same as arrow_batch right? We need to encode it to an IPC bytes then decode because batch_to_bytes_iter runs in a worker and _get_arrow_array_partition_stream runs in the driver? Yes, generally correct. The workers convert their data to serialized arrow (done in `batch_to_bytes_iter`), which creates a `pyspark.DataFrame` with one column of type `ByteType`. `_get_arrow_array_partition_stream` is run in the driver as a helper function because it will be useful to have this in the case where we also want to add `__dataframe__` protocol. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
