linar-jether commented on a change in pull request #29719:
URL: https://github.com/apache/spark/pull/29719#discussion_r683751316
##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -297,8 +297,11 @@ class SparkConversionMixin(object):
"""
Min-in for the conversion from pandas to Spark. Currently, only
:class:`SparkSession`
can use this class.
+ pandasRDD=True creates a DataFrame from an RDD of pandas dataframes
+ (currently only supported using arrow)
Review comment:
Well in case the user specifies a schema, the entire process is lazy, so
there's no need to evaluate any of the rdd elements...
if we keep everything lazy and map each element to either a row or
RecordBatch, we would still need to know which path to take, e.g. for
RecordBatches we need to call:
```python
from pyspark.sql.dataframe import DataFrame
jrdd = rb_rdd._to_java_object_rdd()
jdf = self._jvm.PythonSQLUtils.toDataFrame(jrdd, schema.json(),
self._wrapped._jsqlContext)
df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
```
and for Rows we need to call:
```python
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),
schema.json())
df = DataFrame(jdf, self._wrapped)
df._schema = schema
return df
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]