[GitHub] [spark] linar-jether commented on a change in pull request #29719: [SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames

GitBox Thu, 05 Aug 2021 13:04:14 -0700


linar-jether commented on a change in pull request #29719:
URL: https://github.com/apache/spark/pull/29719#discussion_r683751316




##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -297,8 +297,11 @@ class SparkConversionMixin(object):
     """
     Min-in for the conversion from pandas to Spark. Currently, only 
:class:`SparkSession`
     can use this class.
+    pandasRDD=True creates a DataFrame from an RDD of pandas dataframes
+    (currently only supported using arrow)

Review comment:
       Well in case the user specifies a schema, the entire process is lazy, so 
there's no need to evaluate any of the rdd elements...
   
   if we keep everything lazy and map each element to either a row or 
RecordBatch, we would still need to know which path to take, e.g. for 
RecordBatches we need to call:
   ```python
           from pyspark.sql.dataframe import DataFrame
           jrdd = rb_rdd._to_java_object_rdd()
           jdf = self._jvm.PythonSQLUtils.toDataFrame(jrdd, schema.json(), 
self._wrapped._jsqlContext)
           df = DataFrame(jdf, self._wrapped)
           df._schema = schema
           return df
   ```
   and for Rows we need to call:
   ```python
           jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
           jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
schema.json())
           df = DataFrame(jdf, self._wrapped)
           df._schema = schema
           return df
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] linar-jether commented on a change in pull request #29719: [SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames

Reply via email to