samkumar commented on PR #29719: URL: https://github.com/apache/spark/pull/29719#issuecomment-1818420929
Have there been any updates for adding this kind of functionality since this pull request? Being able to take an RDD of pyarrow RecordBatches or pandas DataFrames and turn it into a Spark DataFrame would be very useful turning a dataset distributed at the workers outside of Spark into a Spark DataFrame for analysis. Even if an API like this hasn't been added, is there any guidance on achieving this (building a Spark DataFrom from an RDD of pandas RecordBatches or pandas DataFrames) in Spark 3.4/3.5? As far as I can tell, the code in this pull request no longer works on the latest versions of Spark because `toDataFrame` now accepts an iterator as its argument, not an RDD. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
