Re: [PR] [SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames [spark]

via GitHub Mon, 20 Nov 2023 00:09:19 -0800


samkumar commented on PR #29719:
URL: https://github.com/apache/spark/pull/29719#issuecomment-1818420929


   Have there been any updates for adding this kind of functionality since this 
pull request? Being able to take an RDD of pyarrow RecordBatches or pandas 
DataFrames and turn it into a Spark DataFrame would be very useful turning a 
dataset distributed at the workers outside of Spark into a Spark DataFrame for 
analysis.
   
   Even if an API like this hasn't been added, is there any guidance on 
achieving this (building a Spark DataFrom from an RDD of pandas RecordBatches 
or pandas DataFrames) in Spark 3.4/3.5? As far as I can tell, the code in this 
pull request no longer works on the latest versions of Spark because 
`toDataFrame` now accepts an iterator as its argument, not an RDD.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames [spark]

Reply via email to